May 29, 2008 at 8:26 am
Hi all,
I am making baby steps with SSIS these days... So my question will sound pretty basic... But here it is:
Let's say you want to insert in TableA on a ServerA all the records in TableB which is on ServerB that are not already in TableA. If you use a linked server it would look like this (random example used here...):
use [dbA]
insert in dbo.TableA( ProductID, Description, Price)
select b.ProductID, b.Description, b.Price
from ServerB.dbB.dbo.TableB
where b.ProductID not in (
select a.ProductID
from dbo.TableA
)
But in this case let's say I don't want to use linked server and I want to use SSIS instead.
I would declare 2 OLEDB data source, one for "ServerA.dbA" and one for "ServerB.dbB"
Then I could use a Data Flow task, and create one OLEDB Source where I select the products in TableA and another one where I select the product in TableB. The "insert" part would be quite straightforward as well since all I would have to do is define an OLEDB destination. But my problem is how to merge the 2 resultsets from TableA and TableB so that I can "feed" my OLEDB destination only with the records that are not already in the destination table.
Thanks in advance!
claude
May 29, 2008 at 10:29 am
You can do this a few ways.
You can use a lookup component in your dataflow. If take the Lookup Failure output (the Red arrow) and use this as the data flow to insert into your destination.
You could use a Merge Join component. In this case, you would need two OLEDB source components, you would have to sort the data, and then you would use the Merge join with a LEFT OUTER JOIN. Then, use a conditional split to get the ones with a NULL from the destination table and insert the records.
Next, you could download the free TableDifference component:
http://www.sqlbi.com/Projects/TableDifference/tabid/74/language/en-US/Default.aspx
There are instructions on the website.
You could use the Slowly Changing Dimension data flow component. This basically compares your data flow to a table and create an insert and an update data flow to insert new records and update existing ones that have changed.
Finally, you could stage the data in a table in the destination database and use a T-SQL statement similar to the one you have already written.
All of these options are legitimate and would be used in different situations depending on the number of records, the types of changes you are looking for (case sensitivity, for instance). If you are just doing some testing to learn how to use SSIS, I would suggest you try all of these options and look at the differences between them for performance, ease of use, and how intuitive they are to maintain in the data flow.
May 29, 2008 at 10:31 am
thanks! very useful answer
May 29, 2008 at 10:41 am
No problem, the "insert if missing" and "update if changed" ideas are really common to data warehousing. Look for information on maintenance of slowly changing dimensions. A Type 1 SCD being the simplest standard type would be of interest to you.
May 29, 2008 at 10:47 am
OT - Hey Michael, have you had a chance to test the upsert (update or insert --> the new MERGE statement) functionality in 08 yet? Be interested to hear any real world performance stats.
Steve.
May 29, 2008 at 11:04 am
I have, but not in an environment to really test any performance.
May 29, 2008 at 1:34 pm
actually I was wondering... based on your experience which method performs best for cases where you have to extract thousands of rows (tens of thousands in some instances) on a nightly basis, apply multiple transformations (remove duplicates, standardize state and country names, etc..., change the format of few things, etc...) and load all that in a database located on a different server... I am sure the answer will be some sort of "it depends", it always is 🙂
FYI: writing the ETL in TSQL using linked servers would be straightforward but since I can't and also since shortly we will have external sources such as excel files via FTP, etc... SSIS seemed like a better way to go... we are working on an ODS
May 29, 2008 at 1:53 pm
Ok, here's my 2 cents.
Using a lookup component is straight forward and easy to maintain and will work for sources of pretty much any size (no problems with 10 million) but only small lookup sources (a few thousand). The lookup source must either be cached or queried once per row - if it cannot be cached reasonably the component is slow.
A Merge Join works great for very large data sets. It is not as efficient as a T-SQL join on already indexed tables, but if you have two different sources and you want to join them together it can be as efficient or more than putting the data into two staging tables, indexing it, and then using T-SQl to join it. A good tip is to sort the data with your source queries and go into the Advanced editor for the source component and manually set it's sort properties rather than using the sort component in SSIS.
The TableDifference component is a Merge Join and a Conditional Split in one component. There is basically no performance difference. It is really easily maintained and very intuitive. It also has the option of comparing with a case-insensitive option that the Merge Join will not do (SSIS is CASE AND ACCENT SENSITIVE in joins, comparitive operations in expressions, and sorts).
The Slowly Changing Dimension data flow component is a wizard and really easy to use. It is horribly slow for anything more than a couple thousand records, but a 4 year old could operate the thing and I like anything easy to maintain. I have not tried the SQL 2008 version of this component, so the new one may perform better.
Staging the data is often the fastest performance option, but it tends to lead to a lot of staging tables and extra data somewhere. It is also less intuitive in many cases to break up the operations.
As a final note. Use the tools for what they are good at. SQL Server / T-SQL is good at set-based operations. Everything in an SSIS data flow is a row-by-row operation - but it is really good at it. Try to split the operations up and use the tool best suited for what you are trying to do and you will be ok.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply