Using SSIS Data Flows with Large Lookups?

  • Hi all - not sure if this is the right forum for SSIS questions, if it is not, please let me know which one I should be using.

    I have a task that requires me to take a raw data set, and process it into a set of tables for different information - we'll call them tables A, B, and C.

    One particular aspect of this task is that table B has a foreign key tying it to table A, and table C has a foreign key tying it to tables A and B - though the foreign key of the table C record tying it to table A need not be the same as that of the record that it is linked to in table B. EG:

    AID ARecord

    1 abc

    2 def

    3 ghi

    BID AID BRecord

    1 1 jkl

    2 1 mno

    3 3 pqr

    CID BID AID ARecord

    1 1 1 stu

    2 3 2 vwx

    3 3 3 yz

    The method which I've undertaken to handle this task is to use SSIS to build table A from the raw data set first. Once this is done, table B is built from the raw data set, with table A used as a lookup to supply the AID. Finally, the table C is built, with address tables A and B used as lookups for the AID and BID fields respectively.

    Now, this task works perfectly fine when the data sets are small enough. The machine on which this process is running has 16gb of memory, and so it can handle pretty intense queries. However, the way in which the lookups are working is using a ridiculous amount of memory, too much even for the machine to handle.

    One raw data set contains some 10 million odd records. When processed into its respective tables A and B, it yields about 8 million table A records and about 10 million table B records. When the task to build table C is executed, the lookups for the tables A and B take up about 7gb of memory each, which causes the task to grind to a halt after processing around 2 million records.

    We were initially using a full cache for the lookups; after some research I discovered that full cache requires there to be enough memory space to handle caching of the entire table, and should that space become unavailable, the task would fail. Partial cache was the next attempt we used, however, the problem with partial caching is that the process took far too long to execute, since the lookups against the table on the database level were far too slow.

    The attempt which I plan on trying next, is to split the build table C into four sets, by dividing the raw data source in four. This would result in the lookups against tables A and B processing far fewer records each, and should allow the entire process to be completed successfully.

    However this approach is a bit cumbersome, because it requires me to duplicate the task four times and change the respective queries. Further, every time I add to the raw data source, I'll need to re-think the splitting such that the tasks become manageable.

    Is there a better approach to handling this requirement, or is my approach the optimal to handle the task?

  • Hi,

    What is the structure of the data? Can you post a sample?

    I'm not clear why you need lookup tables, are you pulling some data item from one table to the other?

    Allister

  • Well, there was two approaches I could have used to this situation. I could read a given row, and then write that row to tables A, B, and C - IE, do the processing for all three tables at the same time, and then do three table writes in the task - or, I could do the write to table A, then B, then C, as three separate tasks.

    The reason why I chose the latter approach is because if I do the three table writes, I have to do all the operations row by row - I write the row to table A, get the identity field back from the write, pass that to the write to table B, get the identity field back, and then pass both of those fields to table C. Further, I would either have to do a duplicate elimination on every write, rather than one at the very end, or I would have to do a cleanup on the other tables after doing a duplicate elimination on one, since they are tied together.

    The table in question is dealing with client information in a raw format, and is trying to map them into a contact, company, and address table respectively. I can't really provide an example of the data since it's somewhat sensitive, but I can provide a table structure if that would be useful.

  • I understand the sensitivity issue, but without knowing anything about the source data structure I can’t really offer a lot of help. Perhaps you can describe a few lines? Destination table structures would help too!

    However, in general terms: I would generally go for the former rather than the latter as by processing source data into 3 destination tables at once you should get a significant performance boost, particularly important for very large data sets. Also by writing to the three tables simultaneously you should be able to reduce the amount of information you need to hold in memory (RAM) to complete the task.

  • Not sure if you replied before I edited my post to explain why I favored the approach of writing the tables as a whole one at a time, instead of writing the records row by row to all three tables at the same time.

    You are certainly right that using one data-flow to handle all three writes at the same time would use far less memory, but I fear it would take far longer to process due to the necessity to do a duplication-elimination at every insert.

    Here's an example of the table structure that's being used, slightly edited.

  • You may be faced with no other option than to import the data into three tables in one dataflow, then perform you duplication-elimination in a second data flow.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply