April 8, 2008 at 8:43 am
I'm attempting to write an SSIS package that
1. updates a table in a target database from the resultset of a staging database table joined to multiple reference tables in the target database where the staging table data's key values already exist in the target table.
2. inserts new data into the target database table from the resultset of a staging database table joined to multiple reference tables in the target database
data volumes would be up to 1M rows in stage table b
1. update target table a
from stage table b joined to N number of reference tables in target database
where stage table data already exists in target table
2. insert into target table a
from stage table b joined to N number of reference tables in target database
where stage table data does not exist in target table
I was able to define the OLE DB Source for the Insert logic in the Data Flow Task. What transformation can be used to perform the Update logic?
April 8, 2008 at 10:55 am
I assume you used an OLEDB Destination for the insert. You could use an OLEDB Command for the update.
1 million rows for this type of operation may be a bit much.
You may want to consider staging your data and using T-SQL to do the insert and update.
I like to use SSIS entirely for dimension update type operations like this, but it has it's limits.
April 14, 2010 at 2:36 am
Not sure if your familiar with it, but Microsoft have a best practice example for data warehouse / ETL called Project Real.
Worth a read.
_____________________________________________________________________________MCITP: Business Intelligence Developer (2005)
April 14, 2010 at 3:08 am
Certainly don't use the OLE DB Command for the updates, as it will issue one million transactions against your DB if you have one million updates.
This is the approach that I use:
0. Set delayvalidation of the package to false and set RetainSameConnection of the connections to staging and DWH to TRUE.
1. Create a global temp table with the exact same structure as the destination table. Just make sure to drop any Identity constraints if any.
2. Start a transaction with an Execute SQL Statement: "Begin TRAN" on the connection to the DW.
3. Use a dataflow task to distinguish between updates and inserts. Do this by using an OLE DB Source on your staging table. Read all rows, perform possible lookups on other tables to get surrogate keys and then finally join with your destination table in the DWH with the business key.
4. use a conditional split after the source. If the SK of the DWH table is NULL, than the row is an insert. If it is not, than a match is found between staging and DWH and that means it is an update.
6. The insert output goes to an OLE DB Destination with fastload which writes to the destination table. Just make sure you don't map the SK of the DWH, as identity_inserts are not allowed.
7. The update output goes to another OLE DB Destination, which writes to the temp table. Here, you want to map the SK. (that's why there should be no identity_constraints on the temp table).
8. After the dataflow, place an Execute SQL Task. This will update the destination table. Just join the destination table and the temp table on the SK. Since this is a (small) integer, this will go really fast.
9. Commit the transaction with an Execute SQL Statement: "COMMIT".
The transaction is optional, but it makes sure that there are only updates if the inserts were succesfull and vice versa.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
April 14, 2010 at 3:11 am
Oh yeah, forgot to mention:
if you have SQL Server 2008, just use the MERGE statement, much much easier :-D.
If you have 2005 and you're allowed to use 3rd party components, you can also use the Kimball SCD component from codeplex.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply