Blog Post

Existence Check Deconstructed -- SSIS Lookup Transformation

,

There are countless ways to mix and match components in SSIS to simulate what SQL Server does in a query.  Here I am going to show you how to replicate an existence check using SSIS components.  The only time I use this type of configuration is when I have cross-server joins.  I find it more efficient for SSIS to hop to different servers in discrete components than to rely on SQL Server to do linked server calls.  Here is an example of a simple select statement that does a cross-server join to validate the data.

SELECT isbn13,title,author

   FROM dbo.elist e

WHERE EXISTS (SELECT 1

       FROM [otherserver].store.dbo.products p

  WHERE e.isbn13=p.isbn13)


To extract this data in SSIS you could simply put this in an  OLEDB Source within DataFlow component like so:

 

There are two reasons I don't care for this method.  First, your package will only have the datasource for the primary server, so you will not know from looking at the package that it also has a dependency on the secondary server ([otherserver]).  Second, you are relying on the efficiency of the linked servers to do the work.  SSIS has a much cleaner mechanism; Lookup Transformation.  Here's how it would look if we took apart the above query and used SSIS components to do the same work:

 

Let's break this image down and see how we created a deconstructed SQL Server existence check in SSIS.  In our OLEDB Source we removed the existence check and simply pull from the source table:

 

Next we drop a Lookup Transformation and in that we put our subselect:

Now at this point each component is pointing to it's own Server.Database source and each of those sources will be listed in the Connection Manager tab.  This design pattern is self-documenting since there are no servers hiding inside components.  This is one of the benefits of using this pattern.  Now we have to connect the two components.

 After connecting them we now need to tell SSIS this is an existence check.  The great thing about this component is that it can act as both  as an exists or a not exists.  We do that in two steps.  The first step is to tell the component to pass all rows that don't match to a special output (it's highlited in yellow).  If we didn't do that, the default is to fail the component and while there are many reasons you would want that, we don't here.

Before we move on to the final step I want to take a moment to explain a little more about this component.  You will notice there are 3 choices at the top; Full Cache, Partial Cache, No Cache.  The default choice is Full Cache and that tells SSIS to select all records from the table for the chosen columns and store them in memory for that component during the initialization phase of the package.  The only two caveats are the records must be unique (use a distinct if you have to) and the total number of records must fit in memory.  If you've jumped ahead a little you'll be thinking "I don't want to use this on very large tables."  That's not totally accurate, as you can also use the partial cache or no cache.  Partial Cache gives SSIS a limited amount of memory and does not prefil it.  What happens is when the first record comes down from the source the component checks it's internal cache and if there is no match it goes out to the database (using the provided query) and tries to retrieve a record to match the key.  If there is a match, that record is stored in the cache and the next time that key is in the data stream the component will not have to go to the database.  You can set up the maximum amount of memory the component is allowed in this tab:

This works out well if you have a large set of values you are checking AGAINST, but the source table is small as the Lookup Transform will only cache the matches and only make a call to the database for each non-existent row.  This means if you have 10,000 records in your source table, but only 100 unique isbn13s, you will only make 100 calls through the lookup (assuming your cache is big enough to hold all 100 keys).  If your 10,000 records were all unique, then you would make 10,000 calls to the lookup table, which would be similar to the No Cache. 

With No cache, the component does a lookup for each incoming record and only stores the last retrieved key.  This can act identical to the Partial cache if you have unique values in the source table.  Even with all the potential gotchas, this combination can be more efficient than a cross-server join.

Finally, you connect the output of the lookup component to the destination.  This is where you can you the Lookup Transformation as either an Exists or Not Exists.  Here is the final product one more time:

 

We are looking to replicate an exists and to do that we use the Lookup Match Output.  This output will be for records that have a match in the output.  If you wanted a not exists you would simply use the Lookup No Match Output.  You can use them both if you wanted, getting a list of records that do exist and records that do not exist.

Have at it and see if this gives your SSIS packages the boost you were looking for.

Original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating