Improving SSIS Update Performance

  • Comments posted to this topic are about the item Improving SSIS Update Performance

    Regards
    Brian Ellul
    ----------------------------------------------------------------------------------
    Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
    - Albert Einstein -

  • Fantastic article,

    Thanks

  • I've been using the UPSERT component from Pragmatic Works, would be interesting to see that in the comparison.

  • Nice article, nice to see some alternatives to the "insert in staging table and then do an update".

    However, it's a we can't review the UPDATE statement used. The article says the UPDATE is performed using the surrogate key, so I can imagine this partitions nicely within the conditional split and you avoid locking issues on the destination table.

    I can think about other scenarious though where the UPDATE statement isn't that straight forward and where multiple threads may interfere with each other.

    Anyway, very nice article and thanks for the effort.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Hi Brian,

    first I would like to thank you for the article.

    I think this week or the last there was also an interesting discussion of the performance of the merge command, which is also a tool to be considered.

    Even when it is very nice and comfortable to design packages with the visual component, connectors and dropdown list (almost only mouse clicks), most of the time scripts are needed to reach better performance. And that is always the problem with the DFT components and processing row by row.

    I know the SQL scripts are more difficult to document, and a SSIS Package is harder to understand (when someone else has to modify something) but I always try to use the merge command or the temporary table solution.

    Kind Regards,

    Paul Hernández
  • Thanks, nice article. Few comments:

    1. Threads can interfere with each other if there are records on the same pages. And this will be a reality unless split is done by the surrogate key.

    2. I would go even further in the changes and create differential update: a. first set the comparison of data and find which records really need to be updated. Then do an update only for those records. This technique allows minimize the interaction time. I have similar set of processes running daily and required to do the simple update/insert. Amount of records is about 100,000,000 in total. But in reality only few thousands are changing. Initial process was done exactly the way you describe in article. When differential load was implemented interaction time (e.g. unavalability of the table for the application) was reduced from 5-10 minutes to the few seconds. I do understand that this required an additional work but I think in most cases this is the most proper way to go.

  • My favorite data loading aids:

    1. The Data Loading Performance Guide White Paper

    2. SSIS Operational and Tuning Guide White Paper

    There are so many factors that can affect performance, but the simple concept of adding threads is well illustrated here and may give someone another weapon in their ETL arsenal. Thank you for the contribution.

  • "ETL load"

    extract transform load load

  • Nice article. Your numbering system needs alignment (starts off zero-based and moves to one-based in the results table).

    I would be interested to see whether the final 'staging table' option would run even faster if it was indexed just before issuing the UPDATE.

    Another technique I have seen used is INSERTing all rows (inserts and updates) directly into a flat view of the table (create view as select * from table) which contains an INSTEAD OF INSERT trigger which performs a MERGE. The good thing about this is that, by tuning batch commit size in the SSIS destination, you can achieve a near-continuous data flow. Might be worth a test ...

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Gosh... still more than 2 minutes using the "fastest" option. How many rows were in the source?

    Heh... and I have to admit this is the first time I've ever seen fully intentional RBAR in an SSIS package. If all the other packages contain the same type of problem, it may vary well be worth spending a couple of extra minutes on each to fix them. I'd also likely use a different tool. For example I can only guess that the reason why someone wrote the original package as RBAR is to allow "healthy" rows to be imported even if "erroneous" rows exist and are rejected by keys on the destination table. BULK INSERT and BCP both allow for "bad" rows to be sequestered in a separate file while the good rows are all loaded. If a BCP format file is used correctly (not required though) and depending on, of course, the number of FKs and other constraints on the target table, both BULK INSERT and BCP are capable of loading millions of rows per minute including some data validation in the process.

    And nice article. Thanks for taking the time to write it and publish it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Great article. Lots to think about. Any thought to testing deleting the rows to be updated so that everything is then an insert. I'm not necessarily keen on that method, but it would be nice to see how it performs alongside the other methods that you have. Or hear other opinions of that method.

  • Oh... I almost forgot. Check the desination table for triggers. If it has one of those bloody "generic audit triggers" (GAT) on it, then that can cause huge ETL performance problems even if it's written as a CLR. We had such a thing at work and when you updated just 4 columns on 10,000 rows of a wide table, it took the "GAT" 4 minutes to do its job. I replaced all of the GATs with hardcoded triggers that I wrote code to generate and delays like that dropped from 4 minutes to < 800MS.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • RonKyle (10/31/2013)


    Great article. Lots to think about. Any thought to testing deleting the rows to be updated so that everything is then an insert. I'm not necessarily keen on that method, but it would be nice to see how it performs alongside the other methods that you have. Or hear other opinions of that method.

    Deletes can be as costly (sometimes even more costly) as inserts because they affect all indexes, triggers, indexed views, and, possibly, some FK checks. Insert/Update will almost always run faster than DELETE/INSERT.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Dunno if it's worth mentioning, but the downside to the tmp table + update option is the lack of SSIS-level, row-by-row error reporting (which makes sense given you're operating set-based). In cases like that, my general approach is to create a preload table that SSIS dumps into, that *has the same schema as the target table*. That way, most common errors (think null column violations, column widths exceeded, etc) are caught when populating the preload table and can be handled the usual way. Then, when doing the update, you can be confident that the staged data will load correctly.

  • If you are looking for performance in a bulk-update scenario, you might want to look into using a View with an Instead Of trigger. That way you can "bulk insert" to the view and have the trigger do the update for you. That allows you to take advantage of SSIS's batching in order to do smaller commits, assuming that makes sense in your case.

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply