Merging two large tables (>100m Rows)

  • SQL 2012

    I have a source table in the staging database stg.fact and it needs to be merged into the warehouse table whs.Fact.

    stg.fact is not a delta feed it is basically an intra-day refresh.

    Both tables have a last updated date so its easy to see which have changed.

    It will be new (insert) or changed (update) data that I am interested in, there are no deletions.

    As this could be in the millions of rows that are inserts or updates then this needs to be efficient.

    I expect whs.Fact to go to >150 million rows.

    When I have done this before I started with T-SQL Merge statement and that was not performant once I got to this size.

    My original option was to do this is SSIS with a lookup task that marks the inserts and updates and deal with them seperately. However I set up the lookup tranformation the reference data set will have a package variable in the SQL commnd. This does not seem possible with the lookup in 2012! Currently looking at Merge Join transformation and any clever basic T-SQL that could help as this will need to be fast, and thats where I think that T-SQL may be the better route.

    Both tables will have >100,000,000 rows

    Both tables have the last updated date

    The Tables are in different databases but on the same SQL Instance

    Each table holds 5 integer columns, one Varchar, one datatime

    Last time I used Merge it was a wider table with lots of columns so dont know if this would be an option.

    Much appreciated.

    E :hehe:

  • Seems like an (almost) duplicate post from this one:

    http://www.sqlservercentral.com/Forums/Topic1604321-364-1.aspx

    There is a lot of text in your message, but actually not a single question. 😉

    Personally I'd try with a simple lookup using full cache (so no variable mapping) or just with "clever" T-SQL 🙂

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Yes and yet No,

    I was hoping for different solutions from different people

    For the SSIS peeps:

    Would a lookup task be perfomant on that size data set. I am not sure of its limits. I was hoping to use the lookup on a subset of the new and existinjg data sets. So all the stuff I know has changed in the new feed, and the minimum data (earliest changed date) they relate to, Then get the new data that is after the earliest change date. This would allow me to pull back say 15 million rows rather than 100 million. I cant see a way of using a package variable in the lookup reference SQL?

    For the SQL peeps:

    Merge has failed in the past for me when data gets large would this be an option here ? How would you do it?

    Many Thanks

    E

  • Ells (8/18/2014)


    For the SSIS peeps:

    Would a lookup task be perfomant on that size data set. I am not sure of its limits. I was hoping to use the lookup on a subset of the new and existinjg data sets. So all the stuff I know has changed in the new feed, and the minimum data (earliest changed date) they relate to, Then get the new data that is after the earliest change date. This would allow me to pull back say 15 million rows rather than 100 million. I cant see a way of using a package variable in the lookup reference SQL?

    Try using an expression on the lookup component.

    Go to the control flow and go to the properties of the data flow. Open up the expressions and there you can select the SQL Command for the Lookup component. There you can write an expression that limits the rows.

    Ells (8/18/2014)


    For the SQL peeps:

    Merge has failed in the past for me when data gets large would this be an option here ? How would you do it?

    Just two simple statements: an INSERT and an UPDATE. With proper indexes of course.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • How many rows will be added / changed between uploads / syncs?

    I suggest considering Change Tracking or some other automated method of determining changes rather than trying to do it yourself. If the numbers are limited, you might pull them into a separate table, clustered identically to the main table, to make the update & insert as fast as possible.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Thanks guys,

    Will look at. The expresssion on the lookup. Previously I have resorted to insert and update statements. I gained performance and control and error trapping.

    Scott,

    I gave used CDC before and found this to be a good way forward. However I can not get any change tracking on this data. It belongs to someone else and buy the time that gets implemented this project will have finished.

    I think that there will be two types of changes in the table. After midnight the new data will come through so there will be a couple of million new rows, then through the day thee will be a small amount of change. I am guessing maybe a couple of thousand. So it's only the big chunk of new data that will be a problem and as that is insert it should be ok.

    Thanks peeps

    E

  • Have added the resolution here.

    Many Thanks

    E

    :w00t:

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply