How to performance tune this Merge statement for our DW ETL process?

  • Hi all,

    I have this relatively simple (logic-wise) merge statement which we do every day to update the datawarehouse's dimension / fact tables from the raw ODS table, but it's taking so long that I would like to have some new idea on how to tune it better..

    the general syntax of the merge statement is simply:

    MERGE ZooskDataWarehouse.dbo.DimUser AS target

    USING OdsUsers AS source

    ON target.UserID = source.UserID

    WHEN MATCHED AND target.LastUpdated < source.LastUpdated THEN

    UPDATE SET xxxxxx

    WHEN NOT MATCHED BY TARGET THEN

    INSERT XXXXX ;

    Basically, each day production (web site) would send DW (datawarehouse) all the users that have updated their account. I will then merge all the data into our ODS table (which stores all the user's data in its raw form)

    then I merge ods to both DimUser (user dimension) and FactUser (User fact table).

    Quite simple in logic. But the problem is the volume.

    We have about 120 million rows of user data (so 120MM row for ODS, DIM and FACT user tables)

    Each day, about 7 million rows of user data changes. Let's say most of them is modified existing user data that go through the update path.

    The SQL itself is taking about an hour to run, even on fast CPU and SSD RAIDed disks, and i actually about 1/3 of the time of the over all process.

    How can I make it more efficient?

    What should I set my cluster index? (right now I only have userID, and a non cluster index with userid and lastupdated date)

    Should I set up partitioned tables?? (problem is.. all users have a chance of being updated, so you can't really partition by time)

    I already have the database set up with multiple files/filegroup

    During the ETL, I can see that CPU / Memory utilization is very low. Disk usage is there, but I can't say it's peaked as well. So I suspect there's a lot of room for improvement in terms of efficiency

    Another thought is to have a where clause on the SOURCE part of the MERGE. For example (select * from ODSUsers where LastUpdated > getdate() - 1)

    Logically it makes sense because then I am only merging the newly updated users to dim. However, it actually turns out to be slow, mostly because the where clause is so inefficient

    I also thought I would do a cluster index on last updated, so this select would be quick. But the index management become very inefficient, because of all the shuffling of the data when it's updated.

    Any suggestions on how I can do better ? thanks

  • Without the query plan its hard to give firm advice.

    I would definately try reducing the rows going into the merge statement:

    ;WITH UpdatedOdsUsers as

    (select * from OdsUsers

    where LastUpdated > dateadd(dd, -1, getdate())

    )

    MERGE ZooskDataWarehouse.dbo.DimUser AS target

    USING UpdatedOdsUsers AS source

    ON target.UserID = source.UserID

    WHEN MATCHED AND target.LastUpdated < source.LastUpdated THEN

    UPDATE SET xxxxxx

    WHEN NOT MATCHED BY TARGET THEN

    INSERT XXXXX ;

    You may want to split the MERGE back into a two-step process, as it's easier the tackle the performance of each individually.

    MERGE is handy, but hard to tune.

    INSERT DimUser (UserId, c1, c2, c3) select UserId, '', '', 0 --defaults!

    from OdsUser where userId not in (select UserId from DimUser)

    --Optimize the insert by having non-clustered indexes

    --on UserId on both tables. By inserting default values, IO reads

    --are greatly reduced.

    UPDATE DimUser SET XXXXXX FROM DimUser du INNER JOIN OdsUser ou on ou.UserId = du.UserId WHERE du.LastUpdated < ou.LastUpdated and ou.LastUpdated > datadd(dd, -1, getdate())

    How many other indexes are on these tables? It may be quicker to drop/recreate the non-clustered indexes vs leaving them intact.

    If you can post the plan, it would really help.

    thanks!

  • 120 million rows are quite a huge number. Have you ever thought of parttioning the table?

    say e.g. you have around 8 parttiton Then you could start say 8 session with each session inserting the 900k rows in each partition based on partition key.

    Even if you drop indexes to load it faster.Recreating indexes on partitioned tables should be fast..

    But again see whether partitioning is suited to you and provide the perfroamnce you are looking.

    GulliMeel

    Finding top n Worst Performing queries[/url]
    Improve the performance of Merge Join(special case)
    How to Post Performance Problem -Gail Shaw[/url]

  • You'd need to show the actual execution plan in order to make any guesses at what's needed to tune the query. One suggestion, see if there are ways to limit the initial data set to only the data that has been touched over the preceding 24 hours. Then the MERGE will be working with less data. That'll help. Otherwise, no guesses without some idea of the structure and how the query is executing.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • The problem very likely has nothing to do with the number of rows being compared. It's probably the number of rows being updated.

    I don't have a definitive test to show you but I did in the past. On my test box, it took scant seconds (like 4) to update a million rows in a much larger table. When I doubled the number of rows, it only doubled the amount of time, just like expected. But, when I tripled the number of rows, instead of it only taking 3 times linger (12 seconds in this case), it took more than 20 minutes.

    I call these things "tipping points" and every server has them. The size may vary, but they all have them.

    My recommendation would be to breakup the MERGE so that you're only working with a million rows at a time. 7 updates of a million rows each in a total of 28 seconds (a guess, for sure) is going to be a lot better than your hour long run.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You say ETL... are you using SSIS for the ETL?

    If so, you can sometimes outperform merge by using the constructs within SSIS to mimic a merge.

    Using a lookup to see if the record already exists... if it doesn't do an insert....

    Do a second lookup to see if the record has changed... if it has... do an update....

    else just toss out the record.

    With my larger DW tables (250 Million rows) that seems to give better performance than the merge, I suspect memory pressure, but our DW server is always under memory pressure these days (looking forward to our nice new upgrade).



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • mtassin (5/18/2012)


    With my larger DW tables (250 Million rows) that seems to give better performance than the merge...

    A race would be interesting.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply