How to load data while creating missing lookups as well in a data flow?

  • Dear all,

    I am trying to load some (simple) data in a table but I need to store a lookup id instead of the description I receive.

    However, some of the lookup records may not exist so I need to create them before doing my final insert.

    It all seems relatively simple but I am stuck with the solution I tried, which is

    - grab the incoming data

    - use a multicast to create a branch in which I'll create the missing rows

    - in the main flow, "wait" for the creation of the missing rows to finish

    - do my final lookup

    - insert into the target table

    In the "Create new lookup rows", I use

    - a sort to do a distinct on th elookup descriptions

    - lookup to join on that description

    - conditional split to extract those with missing lookup

    - insert the new rows in the lookup table

    It all works fine... apart that I have no idea how to force the "main" flow to wait for the creation of the missing lookup rows.

    I kind of understand I can't create a flow out of a destination but...

    Have I got into a complete dead end?

    How am I supposed to handle this (simple) kind of situation?

    Thanks

    Eric

  • What you are trying to do is called "surrogate key generation" and it is pretty standard process in SSIS. Check the following articles:

    Generating Surrogate Keys

    Surrogate key generation in SSIS

    Generating Surrogate-Keys for Type 1 and Type 2 dimensions using SSIS[/url]

    ---
    SSIS Tasks Components Scripts Services | http://www.cozyroc.com/

  • Thanks

  • Hum... I understand the ideas but it sounds terribly complicated for my little problem...

    And I am not convinced that there can't be concurrency issues creeping in.

    In my case, I don't have a current indicator because it's not a dimension and it's simply a "fire and forget" kind of lookup table.

    I am tempted to do it all in tsql since it will take roughly two statements and two scans...

    One insert based on an LEFT JOIN to generate my missing keys followed by an insert based on an INNER JOIN to insert my final records.

    I will have to copy the initial data in my database to start with but never mind... :doze:

    My volume are either small or very small so speed is not an issue.

    Having said that, these articles are very interesting and I might have to do it for proper dimension tables soon, with much bigger volumes.

    Many thanks for the links 😀

  • I looked over the first two articles and felt that they were simplistic at best and problematic at any level.

    The third article I am still digesting..

    However, you can do this a couple of different ways..

    In a data warehouse you often have a dimension package that loads a dimension.. This is usually done before you try and load the fact data..

    Another way, which I am less a fan of, is to capture lookup failures and build the keys in real-time. For SQL 2005 this is more complex, the lookup transform only has 2 outputs, lookup success and error, the SQL 2008 has 3, lookup success, lookup fail, and error. Since this is a 2005 question you have to configure the error output to handle the lookup failure. The problem you will have is that the lookup doesn't really pass just one record to the output, it passes a buffer, and that buffer will almost certainly have more than one record. So your logic needs to be smart enough to verify that it hasn't already added the new record, because you could have more than one record in the buffer that could trigger the insert for the same record. You could then do an additional lookup or just write the new id into the buffer and then use a UNION to bring them back together.

    I try not to use that method except for the most remote case. The adding of rows to the dimension should be substantially less frequent than writing to the fact.

    CEWII

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply