Massive Merge

  • Hi peeps,

    looking at using a MERGE on a table with circa 100 million rows. Will be using all three parts of the merge i.e. insert, update and delete.

    Problems I am encountering is the time taken and the fact that the drive with the log files on runs out of space. I also suspect that space impacts on the performance.

    Is there an easy way for the Merge to be run in chunks? My grip of transaction log files is patchy but as we are in SIMPLE recovery mode the transaction log file should be re-usable once the transaction is committed so if we can do the Merge in multiple transactions then the size of the file should not be as big.

    At the moment the log file bloats to > 50GB. Now if that Merge was done in ten individual transactions then once the first transaction is completed the space in the log file would be re-used by the second transaction and so on.

    Or have I got this wrong.

    Cheers

    E

    PS have indexed the tables to try and improve performance.

    :w00t:

  • You can try to "partition" your source query by adding WHERE clauses.

    Adding indexes might improve performance, but they take up more space, so it's a double edged sword. 🙂

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • As I have the MERGE in a stored proc I was considering passing in a couple of parameters and using the WHERE clause in the SOURCE element of the MERGE to split it into chunks.

    Sounds like that bit is a sensible idea.

    Cheers

    E

  • Batching large DML is a very common (and often REQUIRED process as you have found). Just make sure that:

    1) you use ranges that are guaranteed to not be overlapping and which ensure you process all data that should be processed together in the same batch range

    2) use explicit transaction control

    3) use error handling to rollback and exit and stop looping if error occurs

    4) for optimal performance on large set like this it is often best do drop all but the ONE index you need to do index seeks for each batch, do the work, then rebuild all indexes. beware if you drop clustered index that is a fat operation with some potentially unwanted side-effects so you might want to keep that. obviously this assumes the table can be taken offline for other access that might need those dropped indexes.

    5) consider disabling auto-update-stats for the table of concern and all it's stuff while doing batch operations. waste of time for sql server to interrupt what you are doing to update stats because you modified some rows

    you can often do bulk operations like above even while the system is completely online if you simply make your batch size such that you get index seeks and don't escalate locking to the table.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Many thanks for that.

    Mark

  • DOH!! TOTALLY forgot to mention the transaction log! You need to manage that carefully or you can fill up the disk pretty easily (or at least wind up with a massive tlog). With explicit transaction control you can throw in BACKUP LOG commands as necessary (even doing dbcc sqlperf(logspace) or other means to see how full tlog is getting). If you are in SIMPLE recovery mode you should still do CHECKPOINTs every now and again to enable committed tran flushing.

    A WAITFOR DELAY on each loop is also common to prevent your actions from taking up too much server resources and/or to allow other activity to access your table or get some breathing room if you are doing this while the app is completely online. It also allows the CHECKPOINT to complete.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (8/23/2013)


    DOH!! TOTALLY forgot to mention the transaction log! You need to manage that carefully or you can fill up the disk pretty easily (or at least wind up with a massive tlog). With explicit transaction control you can throw in BACKUP LOG commands as necessary (even doing dbcc sqlperf(logspace) or other means to see how full tlog is getting). If you are in SIMPLE recovery mode you should still do CHECKPOINTs every now and again to enable committed tran flushing.

    Yep, did not take me long to find the tlog bloats like crazy. Now off googling committed transaction flushing. I think that the checkpoints are the missing link in my plan.

    Cheers

    Mark

  • Ells (8/23/2013)


    TheSQLGuru (8/23/2013)


    DOH!! TOTALLY forgot to mention the transaction log! You need to manage that carefully or you can fill up the disk pretty easily (or at least wind up with a massive tlog). With explicit transaction control you can throw in BACKUP LOG commands as necessary (even doing dbcc sqlperf(logspace) or other means to see how full tlog is getting). If you are in SIMPLE recovery mode you should still do CHECKPOINTs every now and again to enable committed tran flushing.

    Yep, did not take me long to find the tlog bloats like crazy. Now off googling committed transaction flushing. I think that the checkpoints are the missing link in my plan.

    Cheers

    Mark

    The WAITFOR can be important too, as there is kind of a throttling mechanism on CHECKPOINT to keep it from overwhelming the system.

    Also tlog VLFs cannot be flushed if they contain UNcommitted transactions, so if you have a busy system (or very large VLFs) you could be in trouble...

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (8/22/2013)


    2) use explicit transaction control

    3) use error handling to rollback and exit and stop looping if error occurs

    Note that you cannot wrap the whole thing into a single transaction, because they you will have the same effect as when you ran everything in the same statement.

    Basically, when one batch is done, it should be committed. But since someone could pull the plug while the operation is running, you need some method to make sure that you cause a mess if the operation is restarted.

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply