Anatomy of an Incremental Load

  • pleasehelpme,

    I hope I made it clear in the article that this is a demonstration of principle and not production-ready code.

    There's a bunch of other stuff you can do (and will need to do) to design an SSIS application to load hundreds of millions of rows.

    With that volume of data I do not utilize design patterns of any sort. Everything is developed to meet the conditions of the things that impact SSIS performance - from the network topology to the shape of the data.

    kenambrose,

    Yep.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Hi Andy,

    Thanks for the quick response.

    In the lookup, I am matching all of the columns in the composite key. However, in the conditional split for New Inserts I wasn't sure whether to use:

    ISNULL(contractid)&&ISNULL(start_date)&&ISNULL(EndDate)&&ISNULL(effective_date)

    OR

    ISNULL(contractid) || ISNULL(start_date) || ISNULL(EndDate) || ISNULL(effective_date)

    That is “&&” or “||”

    Thanks,

    Please advise

    dbdeli

  • Hi dbdeli,

    Since you're checking for a Null and all columns returned will be Null for a new row (a row where the Left Outer Join fails to find a match), you can simply pick any one of them for the test (given your join is defined as ON all four columns with AND operations).

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • I've implemented the Anatomy of Incremental load the same was as it was illustrated. The package was working for 2 months.

    The ETL SSIS package that loads a dimension and two fact tables. The package has been running successfully for the past two months. But started failing last week. It runs perfectly in the development server but fails on production.

    I did some troubleshooting that found that it only fails when trying to load the final fact table and followed some recommendations that were posted but kept getting similar errors”

    [7131]: SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x8007000E.

    Error: 0xC020204A at fact_contract_count incremental load, contract count exception [7131]: Unable to retrieve column information from the data source. Make sure your target table in the database is available.

    Error: 0xC004701A at fact_contract_count incremental load, DTS.Pipeline: component "contract count exception" (7131) failed the pre-execute phase and returned error code 0xC020204A.

    Error: 0xC0047031 at fact_contract_count incremental load, DTS.Pipeline: The Data Flow task failed to create a required thread and cannot begin running. The usually occurs when there is an out-of-memory state.

    Error: 0xC004700E at fact_contract_count incremental load, DTS.Pipeline: The Data Flow task engine failed at startup because it cannot create one or more required threads.

    I completely disabled the email and error event handlers and I keep getting the same error or similar errors.

    Also tried set the AlwaysUseDefaulCodePage Custom Property of the OLE DB Source control to True. And did the same for the destination but I get similar errors. In addition, I even set the delay validation to true. But didn’t help.

  • Hi,

    Nice article it helps me a lot so thanks!

    But 🙁 I think that there is something missing beetwen "Write: Changed Rows" and the "Conclusion". Even numbers of images aren't correct. Could someone look on it.

    Thanks !

  • Beware that this approach doesn't seem to work well with big tables. If you have many millions of rows and a couple dozen columns, you will run out of memory, as lookup transformation pre-caches the whole dataset. It is possible to configure partial cache/no cache, but in this case execution speed degrades to a crawl.

    On the other hand if a table is under a million records and you don't have that many columns that can change in an incremental load, then this approach does work very reliably.

  • Andy Leonard (2/11/2008) As I mentioned in the article, "NULL-handling in SSIS Expression Language could darn-near rate its own article!"

    We ended up doing something like this:

    (ISNULL([Target.RebateDay] != [Old.RebateDay]) ? !(ISNULL([Target.RebateDay]) && ISNULL([Old.RebateDay])) : [Target.RebateDay] != [Old.RebateDay])

    we have to repeat this for every nullable column which is a tad ugly. Thankfully in our case there are not many nullable columns.

    Your approach works with strings, but a) there are other data types and b) what if values change from null to empty string and back on a regular basis on a large percentage of rows?

    PS. I haven't read the whole thread, my apologies if this or the issue above has already been addressed.

  • Andy:

    Well done. Enjoyed the article since I am a Microsoft BI developer/data architect and have gone through much of

    what you have written.

    May I suggest for an article, how the source dB is prepared for incremental loads? Datetime modified, datetime created, SID's or GID's, etc.

    Great work! Love this forum!

    Jim O'Toole

    casinc815@hotmail.com

  • This is a good example that I would consider for use with a non-sql server based implementation. However, for sql server to sql server implementations, this could be accomplished much easier with replication.

  • Excellent article, thanks Andy. It's also very well timed for my company. We're having some trouble convincing people that trunc & reload is NOT the industry best practice.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Touche'

    It would only be used in development of a warehousing project. Afterr initial load of all data, incremental should be the choice.

    I have also run into the opinion that trunc and reload would be preferred. This example is companies that are managing 50G of less in data. Maybe this will work fine. But if you move Gig's daily (nightly) then incremental is truly the way to go.

    Jim

  • Nice article Andy.

    It makes me want to revisit a recent package where I needed to find the records Added/Changed/Deleted. Unfortunately I no longer have access to the data. One table had over 89 million rows. It would have been a good test of how well the lookup on a large table.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Nice article, This seems better than the out of the box SCD.

    I am intrigued by the comment about SQL2008 able to detect changed rows at the source. It is likely assuming a SQL Server 2008 source?

    Another idea would be to compare the rows at the source using a distributed query before loading the stage table(s) for update insert delete

    For me, my main source is Oracle, I use several methods, it depends the size, the context and the performance.

    - For small tables where I do not care about deletes, out of the box SCD is good

    - For small to medium tables where I need deletes, a solution like Andy's looks good

    - For large tables I do it in two ways, 1 way involve to put the production table offline for 1ms, the other involves to slow down access to the prod table.

    Updates and deletes can be slow, and doing all of it in place may require proper handling of locks.

    Solution 1

    Load the data i.e. 3 months, into a staging table

    insert into the staging table the production data older than 3 months

    Drop the production table

    Rename the staging table to production

    rebuild indexes

    -

    Solution 2 ( I wish I would know how to hash a natural key, I never found a way that guarantees 100% uniqueness and keep it small)

    Load the data into stage

    -- this in procedures called by ssis, use transaction, try catch and add back to update queue if deadlock are a concern.

    Delete from prod

    Where key not in (select s.key from s)

    ----

    Update p

    Set p.val1 = s.val1

    p.val2 = s.val2

    From Prod p

    Join Stage s on p.key = s.key

    Where

    p.val1 != s.val1

    OR p.val2 != s.val2

    ----

    Insert into prod(val1, val2)

    Select

    Val1, val2

    From Stage s

    Join prod p on s.key p.key -- if this join too slow, use subquery

    BI Guy

  • Andy,

    Great post.

    What if the source were db2 and the destination were Sql Server. Would you use linked servers for the initial query? Or is there another way?

    Thanks.

  • maccenon, the editor corrected the missing information recently (thanks Steve!). The entire article should now display.

    zespri, this article is intended to demonstrate introductory principles of incremental loads. I actually have a script I execute against SQL Server sources that builds the Changed_Rows conditional expression dynamically. You are correct - the value substituted for NULL in the TRUE argument of the ternery conditional cannot appear in the data itself, or you run the risk of a logical collision. Data profiling will mitigate this but when loading human-entered data, all bets are off.

    casing815, thanks - excellent suggestion. Perhaps I could include information on the kinds of data profiling I do during ETL design.

    Shon, thanks. You are correct - the load demonstrated in this article could be accomplished using replication. However, I've seen replication used where SSIS should be used more times than I've seen the inverse - especially custom replication.

    Grant, thanks bro! Whack 'n Load does have the advantage of either failing or delivering an exact copy. That's an important distinction to make. When you filter, you run the risk of having an unwanted row slip through or be unintentionally blocked.

    casinc815, I've had a few opportunities to convert destructive loads to incremental. In one extreme case on which I merely consulted, the client was able to scale the ETL process 40x within the same maintenance/load window. That was cool!

    Alvin, durnit! Here's hoping you get the chance. As others have pointed out, this example doesn't exhaust all you'd need to do to gain performance from your load. You would want to pay particular attention to the lookup strategy, and I'd seriously consider implementing a Change Data Capture (CDC) mechanism.

    Philippe, thanks. I know very little about Oracle but I've heard Oracle has Change Data Capture capabilities. I've built my own CDC in the past, usually implemented with triggers and default constraints. I've seen some implemented in a stored procedure API. The bottom line is you need something doing automatic updates/inserts in which you have complete confidence. I should probably write about this as well - I've used bit columns, int columns, char(1) columns, and LastUpdated/ModifiedDate DateTime columns. The idea is to be able to detect rows that were changed at the source, then you don't have to load those records into the Source Adapter in SSIS (only to throw them away at the Filter Conditional Split). The underlying assumption is that you will be able to alter the schema of the system of record, and this is not an option in many cases.

    Jay, thanks. It depends [(R)Andy Warren]. If the source is DB2 and the dataset is small enough and won't scale rapidly (that's a lot of conditions...), I use the OLEDB source as shown here. More times than not, I stage data from DB2 and Oracle in a staging SQL Server database. I make the staging database schema as close to the source system schema as possible, and I remove all the constraints from the staging tables. This makes for fast loads that interfere with the source system as little as possible. Plus, I now have the data in a SQL Server database which is, for me, more familiar territory. From there I can load the destination database incrementally. I'd likely stage the data incrementally as well - or transactionally, depending on the requirements of the system.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

Viewing 15 posts - 61 through 75 (of 101 total)

You must be logged in to reply to this topic. Login to reply