Merge Performance

  • Hi all,

    I'm merging two tables. Each table contains approx 1 million records currently (the tables will always be a similar size). The MERGE is taking around 25-30 minutes, which seems way too long to me?

    Each table has the following indexes:

    - nonclustered, unique, primary key (nvarchar(60)) - used for the MERGE join.

    - clustered, unique (int)

    Other indexes have been disabled.

    There are four INDEXED VIEWS based (partly) on the target table. I mention this because they appear in the execution plan with a warning, because at least one of the VIEWS is empty (no stats), although it doesn't appear to have any impact on the operation.

    The attached execution plan is from a MERGE operation when the contents of both tables were exactly the same, i.e. no INSERT or DELETE operations were performed.

    Can you suggest any areas for improvement, or even a better alternative to the MERGE statement?

    If you need additional information please let me know.

    Paul

  • I use the Following, in our scenario we are merging a site into a table that contains all sites i find that the CTE restricts the query plan on the source table helping performance dramatically. as instead of merging 100 changes into 100,000 rows its then merging 100 into a 1000.

    I use the below as a header to the merge statement. the idea isnt mine of course it came from a blog somewhere on optimising the merge statement.

    --CR 04/07/2012 Optimised Merge with a CTE filter

    WITH Filteredsite AS (

    SELECT *

    FROM activity

    WHERE Site_Id = @Site_Id

    )

    MERGE FilteredSite As target

    .

    .

    .

  • Another thing you may want to consider is adding in some sort of check to see if the columns actually have different values. I use CHECKSUM for this....

    [Code="sql"]

    WHEN

    MATCHED AND

    (

    CHECKSUM

    (

    [Source].whatever

    )

    <>

    CHECKSUM

    (

    [Target].whatever

    )

    )

    THEN UPDATE.....

    [/code]

    That way you're only updating rows that have changed and not simply all that match.

    🙂


    I'm on LinkedIn

  • yayomayn (7/19/2012)


    Another thing you may want to consider is adding in some sort of check to see if the columns actually have different values. I use CHECKSUM for this....

    [Code="sql"]

    WHEN

    MATCHED AND

    (

    CHECKSUM

    (

    [Source].whatever

    )

    <>

    CHECKSUM

    (

    [Target].whatever

    )

    )

    THEN UPDATE.....

    [/code]

    That way you're only updating rows that have changed and not simply all that match.

    🙂

    I do the same thing. Huge performance increase when most of the data won't change frequently, but might change periodically. Tried testing equality on each column in source and target, but NULL issues definitely make Checksum the way to go on that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks for your prompt responses.

    @colin - I implemented this and the time cam down to around 16mins, which is a big improvement, thanks.

    @yayomayn - I'm concerned by what it says on MSDN about CHECKSUM:

    "there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change."

    They propose using HASHBYTES instead. I'm not sure how to use HASHBYTES, so I'll have to do some research.

    Paul

  • bpw (7/19/2012)


    Thanks for your prompt responses.

    @colin - I implemented this and the time cam down to around 16mins, which is a big improvement, thanks.

    @yayomayn - I'm concerned by what it says on MSDN about CHECKSUM:

    "there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change."

    They propose using HASHBYTES instead. I'm not sure how to use HASHBYTES, so I'll have to do some research.

    Paul

    Hashbytes is pretty much the same. You just have to choose which version to use. It's pretty easy.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • bpw (7/19/2012)

    Each table has the following indexes:

    - nonclustered, unique, primary key (nvarchar(60)) - used for the MERGE join.

    - clustered, unique (int)

    Paul

    >> ON (t.ProductID = s.ProductID) <<

    So is ProductID the nvarchar(60)?? [Btw, does it really need to be nvarchar?]

    Which column is the "clustered" int?

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Hi Scott

    Yes, ProductID is the nvarchar(60) and no, it doesn't need to be nvarchar. I've been reluctant to change it because of the implications for other tables/SPs - would it make a difference?

    The column 'OurProductID' is the clustered int?

    Since posting, I've been experimenting with batching, as suggested by Colin. I've found that batching with TOP (as suggested on MSDN) makes no difference.

    I've also tested using the statement:

    WHEN MATCHED AND NOT EXISTS(SELECT s.* INTERSECT SELECT t.*) THEN UPDATE

    This improved things slightly (surprised me), and I'm now down to around 10 minutes for the merge.

    Is this a reasonable time, or might it be improved? It still seems a long time to me.

    Paul

  • I suspect the MERGE time could be reduced by clustering on the ProductID, but with it being nvarchar(60), that's not particularly appealing.

    If you can go to just varchar(60), and the avg len is 30 or less, and it matches other uses of the table better than the id [there is almost always a better clustered key than an ident], I would try clustering the table by ProductID.

    For additional index info, please post results of these queries:

    USE <your_db_name>

    SELECT

    GETDATE() AS capture_date,

    DB_NAME(mid.database_id) AS Db_Name,

    OBJECT_NAME(mid.object_id) AS Table_Name,

    mid.*, igs.*

    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

    mig.index_handle = mid.index_handle

    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats igs WITH (NOLOCK) ON

    igs.group_handle = mig.index_group_handle

    WHERE

    1 = 1

    AND mid.database_id = DB_ID()

    AND mid.object_id IN (OBJECT_ID('<table_name>'))

    ORDER BY

    Db_Name, Table_Name, equality_columns, inequality_columns

    SELECT

    DB_NAME() AS db_name, OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name, i.name AS index_name,

    (SELECT DATEDIFF(DAY, create_date, GETDATE()) FROM sys.databases WHERE name = 'tempdb') AS sql_up_days,

    dps.row_count,

    --ius.user_seeks + ius.user_scans AS total_reads,

    ius.*

    FROM sys.indexes i WITH (NOLOCK)

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = i.object_id AND

    dps.index_id = i.index_id

    LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

    ius.database_id = DB_ID() AND

    ius.object_id = i.object_id AND

    ius.index_id = i.index_id

    WHERE

    i.object_id > 100 AND

    i.is_hypothetical = 0

    AND OBJECT_NAME(i.object_id) IN ('<table_name>')

    ORDER BY

    1, 2, CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END, 3

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • my target table is 44Million rows and has an int PK but is clustered on the Site and the PK with the site First. I typically merge 107k rows a site avering about about 2 mins.

    Check your disk io to the tran log mines a san thats delivering 40MB as sec

  • I wanted to update this thread to let you know what I've implemented and to thank you for your ideas and input.

    @scott - you got me thinking about the keys on the Product table. Although the nvarchar column needs to be a unique key, I'm still not sure why it was chosen as the PK. After assessing the impact of moving the PK to the int field, I realised the change would be relatively straightforward, so we've implemented this. We're now merging on the int columns.

    @colin - as you suggested, I've implemented the merge on a 'filtered' basis using the parent table PK. I'm using a VIEW rather than the WITH statement though, because I'm frightened by the warning in this article:

    "Use the WITH <common table expression> clause to filter out rows from the source or target tables. This method is similar to specifying additional search criteria in the ON clause and may produce incorrect results. We recommend that you avoid using this method or test thoroughly before implementing it."

    Using this filtered method has also allowed us to exclude some records based on the parent table record status.

    The end result is that the merge is now taking less than 3mins, which is a huge improvement.

    I've also made it easy to switch between three merge methods: filtered (as described), unfiltered (the two tables are merged without filtering and all records are compared), and unfiltered in batches, as described in the above article. As the records grow we'll be able to switch between methods to make sure we're still using the most efficient approach.

    Thank you again for your interest and input into this - I'm very grateful.

    Paul

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply