SSIS Lookup - large Upserts

Question

SSIS Lookup - large Upserts

Ells

SSCrazy Eights

Points: 9901
More actions
August 18, 2014 at 3:09 am

#292244

Using SSIS 2012
I have a source table in the staging database stg.fact and it needs to be merged into the warehouse table whs.Fact. stg.fact is not a delta feed it is basically an intra-day refresh. Both tables have a last updated date so its easy to see which have changed. It will be new (insert) or changed (update) data that I am interested in, there are no deletions. As this could be in the millions of rows that are inserts or updates then this needs to be efficient.I expect whs.Fact to go to >150 million rows. When I have done this before I started with SQL Merge and that was not performant once I got to this size.
My original option was to do a lookup mark the inserts and updates and deal with them separately. However I set up the lookup transformation the reference data set will have a package variable in the SQL command. This does not seem possible with the lookup in 2012! Currently looking at Merge Join transformation.
Is there a better way to do this?
Many Thanks

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Koen Verbeeck SSC Guru Points: 259105 More actions · Answer 1

This is possible, but only when using partial caching (which might not be as performant). Is there a reason to use the variable?

What about just joining the staging table and the fact table and doing a seperate INSERT and UPDATE statement?

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Ells SSCrazy Eights Points: 9901 More actions · Answer 2

From what I have read Partial cache may not be a good option. I am considering Merge Join which will make the inserts and updates on the limited set of data. However suspect this may not be quick.

Very cautious over this as I want it to work first time and properly.

Thanks

E

Koen Verbeeck SSC Guru Points: 259105 More actions · Answer 3

Ells (8/18/2014)
From what I have read Partial cache may not be a good option. I am considering Merge Join which will make the inserts and updates on the limited set of data. However suspect this may not be quick.
Very cautious over this as I want it to work first time and properly.
Thanks
E

The MERGE JOIN needs sorted data, so this can give a performance hit as well.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Ells SSCrazy Eights Points: 9901 More actions · Answer 4

I am inclined to surrender and go for the lookup. Perhaps what I wanted is not so easy. I just thought it would be nice to do the lookup against as small a selection as possible.

Will put thinking cap on again.

Thanks

E

🙂

Ells SSCrazy Eights Points: 9901 More actions · Answer 5

The solution I have gone for (thanks to Koen) is to use a lookup transformation and to have an expression on there that will dynamically limit the number of rows required. The expression is set on the Data Flow task and provides a variable for the SQL command of the Lookup task.

I know that the data coming through my data flow task is only changed data from the Last updated datestamp(so as efficient as possible). I then get the earliest date the data appplies to (Date key). The lookup then presents a slice of the reference set through the expression. This gets all the data >= the earliest Date Key in the changes.

The lookup either feeds an insert or update OLEDB commands. With this data set I foresee a couple of million inserts on the new day, then regular small number of updates throughout the day but will wait and see.

Many Thanks Peeps

E

🙂

Koen Verbeeck SSC Guru Points: 259105 More actions · Answer 6

Ells (8/20/2014)
The lookup either feeds an insert or update OLEDB commands. With this data set I foresee a couple of million inserts on the new day, then regular small number of updates throughout the day but will wait and see.

Personally I would insert the updates in a staging table and do a single TSQL UPDATE statement.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Ells SSCrazy Eights Points: 9901 More actions · Answer 7

Yes,

on my searches through I did come across this as best practice and I will be looking at this as an enhancement. Was more focused on the lookup.

So that's two things I haver learnt, better take it easy now.

Many Thanks

E

skaggs.andrew SSCrazy Points: 2476 More actions · Answer 8

Just some food for thought, but I recently was working on updating my ETL process for building my fact table. We run a process every night. Incremental loading has done wonders for time efficiency when going through this process. (Same as you, just inserting new and updating existing) Using the tsql to make updates also works extremely fast. Also, have you thought about using partitions on your Fact table? This was something that also helped me out a lot.