Deleting Millions of Rows from Multiple Tables

  • george sibbald (5/1/2009)


    GSquared (5/1/2009)


    I think I'd go with something at the beginning of the loop in the proc that checks the log and backs it up as needed.

    That way, you don't have the concurrency issues that you might otherwise run into.

    could you expand on that gus? do you mean avoid backing up the log at the same time you are writing to it?

    You do need to be careful where you set the percentage full but its a damn good failsafe option.

    Yeah, that's pretty much what I'm refering to.

    Steve: Yes, SQL and the OS will handle it, but at a physical level, whatever drive(s) your log is on, you're going to be thrashing the heads around like crazy if you're backing up, clearing, and writing, the log file, all at the same time.

    Normally, that wouldn't bother me. But when you're specifically deleting millions of rows of related data from multiple tables, and doing so repeatedly, it just seems to me that being conservative on the thrashing makes sense. Maybe it's overkill on my part, but I'm tending to think of the backup and log-clearing on this one as part of the overall transactional overhead.

    It would probably be worthwhile to test both ideas on a few dozen iterations, then go with whichever will get the most rows handled per minute.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Is there a way to setup a general alert when any database on the system reaches a certain log % full?

    I'd love to do this... but I've got 2000 databases per SQL instance. I'd prefer to not have to setup 2000 alerts and 2000 startable jobs.

    Rather, setup a single alert that when any database hits 75% capacity on the t-log to just back all of them up 🙂

    Just not thrilled with setting up 2000 alerts.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • mtassin (5/5/2009)


    Is there a way to setup a general alert when any database on the system reaches a certain log % full?

    I'd love to do this... but I've got 2000 databases per SQL instance. I'd prefer to not have to setup 2000 alerts and 2000 startable jobs.

    Rather, setup a single alert that when any database hits 75% capacity on the t-log to just back all of them up 🙂

    Just not thrilled with setting up 2000 alerts.

    If you're going to do that, why bother with an alert? Why not just fire off a backup command at that point?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I would just run the backup, as Gus noted, but I'd log it somewhere as well, maybe for review the next day. Good to know these things are happening, and it might be a trigger to order more disk space, grow the size of the log/data files, etc.

  • there is an instance of _total for the counter, but not at all sure what that is actually counting. :unsure:

    I would feed the results of dbcc sqlperf(logspace) into a table, query the table and backup any database log reporting > 75% space used.

    This would be a last resort if normal backup strategy was still allowing the occasional log bloat (during reindexing say)

    ---------------------------------------------------------------------

  • I would try my best to avoid the cursor method as well. If the vendor stored procedure is not encrypted, a .net procedure, etc you should be able to open it and see what they are doing. Then convert it to a set based delete versus by a parameter. Then like some of the other said, you can use a temp table (or the table you have already) to join on the ids you want to delete. Just setup a where condition to limit the number removed per batch by using an Id range or something similar.

  • Steve Jones - Editor (5/5/2009)


    I would just run the backup, as Gus noted, but I'd log it somewhere as well, maybe for review the next day. Good to know these things are happening, and it might be a trigger to order more disk space, grow the size of the log/data files, etc.

    Why not just query the backup tables in msdb for that?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • You could do that, but what if something else happened there? I'll admit it's not a bad solution, an it's not likely something will happen, but I prefer things like this to get logged by the process that does them, and then you have a report that runs against the logs.

    I've avoided MSDB querying since I have set limits on amounts logged, and never wanted to trust those tables for backup/restore data. Prefer to get the data from actual files on the file system (if they're gone, they're useless, not matter what msdb says) and read the files to get header/details.

  • george sibbald (5/5/2009)


    there is an instance of _total for the counter, but not at all sure what that is actually counting. :unsure:

    I would feed the results of dbcc sqlperf(logspace) into a table, query the table and backup any database log reporting > 75% space used.

    This would be a last resort if normal backup strategy was still allowing the occasional log bloat (during reindexing say)

    I may have to go that route... mostly its to handle exactly that... Most of our logs don't grow and backing them up takes about 3 min tops (usually less than 1:15). But I'd love to have an insurance policy setup.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • dmc (5/5/2009)


    I would try my best to avoid the cursor method as well. If the vendor stored procedure is not encrypted, a .net procedure, etc you should be able to open it and see what they are doing. Then convert it to a set based delete versus by a parameter. Then like some of the other said, you can use a temp table (or the table you have already) to join on the ids you want to delete. Just setup a where condition to limit the number removed per batch by using an Id range or something similar.

    Using a loop to avoid using a cursor doesn't actually gain you anything. They have the same drawbacks.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • You are correct. I am not saying to do a while loop or something similar. I was going with the idea of taking the code the vendor provided so you make sure you hit all needed tables etc to delete and handle everything properly. To often I find the vendor does something a little "odd" that you would not do in a normal database task, so reviewing their code ensures you meet the needs of of the application tied to the db. Then delete it as a set based action based on a join to the table of IDs to remove. You could do it as all 4000, or do subsets, etc. I would probably do a smaller set like 250 - 500 or maybe even 1000 for example. Review the results, ensure the app runs, etc. Then backup your logs, shrink your logs, check disk space ensure everything is a-ok. Then run it for another subset, repeat. Does not seem to be a need to automate a loop process, with automatic log shrink code, etc unless this is a scheduled periodic process. My impression is this is more of a once in a great while archive/purge process.

  • Reviewing the code does make some sense. You might be able to make something more efficient.

    But if the proc does what's needed, why not just use it as-is? Setting something up to run through the items you need and run the proc for each one should only take a couple of minutes. Adding in any needed logging and backup operations would add a minute or two to that. Then you just run it and let it go.

    I'm lazy. Something like that, for a one-off solution, makes sense to me. Why spend hours (potentially) reviewing their proc, validating it, writing a replacement, testing the replacement, then set up a loop to run through the replacement? More work, same end result, so far as I can see.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Agreed. I guess its difficult to say what to do unless you are intimate to the situation. Maybe the code is as simple as some deletes and uses a parameter in a where clause. Then would it not make sense to convert that to set based in a matter of minutes and run it that way versus a cursor, would it not? Maybe the code in the sp is very complex, possibly encrypted, etc and hence it would ofcourse not make sense to try and re-write it. So in that case it would make sense to probably loop it in a cursor or such as that would be time better spent.

    Maybe the solution is not to worry about log size, rather to take the db out of service during a normal maintenance period and put it to simply logging mode and run your deletes. Maybe its to mission critical and maintenance time is not allowed.

    I think my point stands and is valid as an option to avoid cursor and make it a faster set based operation (or most likely faster). Is it the right way? Maybe not. But the solution is as valid as the cursor based solutions based on the informaiton available. And thats why I provided it, as another alternative for the orginal poster to evaluate and make his/her decision upon. I have deleted millions of rows from various databases, and its always different how i do it. The criteria of uptime, integration points, database recovery model, server speed, etc all come into the decision.

    I can't say what is the best way, but I find that as a general rule if I can make something set based versus cursor/loop based in general I am better off for going that route.

  • GSquared (5/5/2009)


    Reviewing the code does make some sense. You might be able to make something more efficient.

    But if the proc does what's needed, why not just use it as-is? Setting something up to run through the items you need and run the proc for each one should only take a couple of minutes. Adding in any needed logging and backup operations would add a minute or two to that. Then you just run it and let it go.

    I'm lazy. Something like that, for a one-off solution, makes sense to me. Why spend hours (potentially) reviewing their proc, validating it, writing a replacement, testing the replacement, then set up a loop to run through the replacement? More work, same end result, so far as I can see.

    I like the way you think.

    Yea I think I would go with this approach - use thier code and bracket it with some log file backups to avoid blowing up the disks.

    As far as a clustered index is concerned - this is where I part with the poster. I don't see a problem with putting a clustered index on a vendor supplied table. In this case you may avoid a lot of hassle during this process. Look at it this way, you can always temporarily throw it on the table for this process and drop it after you're done. Of course a good spot check with the vendor may be in order, as well as a quick-run through a test area to make sure you're not shooting yourself in the foot...

  • dmc (5/5/2009)


    Agreed. I guess its difficult to say what to do unless you are intimate to the situation. Maybe the code is as simple as some deletes and uses a parameter in a where clause. Then would it not make sense to convert that to set based in a matter of minutes and run it that way versus a cursor, would it not? Maybe the code in the sp is very complex, possibly encrypted, etc and hence it would ofcourse not make sense to try and re-write it. So in that case it would make sense to probably loop it in a cursor or such as that would be time better spent.

    Maybe the solution is not to worry about log size, rather to take the db out of service during a normal maintenance period and put it to simply logging mode and run your deletes. Maybe its to mission critical and maintenance time is not allowed.

    I think my point stands and is valid as an option to avoid cursor and make it a faster set based operation (or most likely faster). Is it the right way? Maybe not. But the solution is as valid as the cursor based solutions based on the informaiton available. And thats why I provided it, as another alternative for the orginal poster to evaluate and make his/her decision upon. I have deleted millions of rows from various databases, and its always different how i do it. The criteria of uptime, integration points, database recovery model, server speed, etc all come into the decision.

    I can't say what is the best way, but I find that as a general rule if I can make something set based versus cursor/loop based in general I am better off for going that route.

    Set-based is, generally speaking, better than loops (of whatever sort).

    Considering that this is SQL Server, it wouldn't even matter if the proc were encrypted, I'd be able to figure out a set-based way to do the archival/delete process. Build it from the table schemas if I had to.

    The problem with that is that, when handling millions of rows in multiple tables, it's often more efficient to NOT do a full-on set-based solution. Not because SQL Server can't handle that, but because it can do horrible things to your hardware.

    I've had cases where deleting as few as a million rows would take literally days, and would expand both tempdb and the log files by dozens of gigabytes, but where deleting a thousand iterations of a thousand rows each would take only a few minutes, have no measurable impact on tempdb, and the log files could be handled by either Simple Recovery, or intermittent log backups.

    In that case, there is zero advantage to using "something else" vs using a cursor, so far as I can tell.

    How you loop the process is immaterial. Whether you loop it or not is the real question.

    If you're going to do the whole thing as a single large transaction, which is what "set based" pretty much boils down to, then working out how to do that, either from the schema or from a vivisection of the proc, would definitely be in order.

    If you're going to have to break it up into smaller chunks, for whatever reason, why reinvent the wheel at all? Take the existing proc, fire it into a cursor, add whatever log file handling you need, fire it off, and get onto your next job or take a break, and let it run. Monitor as needed.

    Building more than that, in this kind of case, is pretty much, in my opinion, just coding for the sake of coding. I don't see the ROI in it.

    On the other hand, if you do vivisect the proc, and find that it's cursoring its way through the sub-tables, one row at a time, then stay the heck away from that piece of junk and build your own solution, definitely! Then use a cursor to fire that proc, just as you would for the extant proc.

    If that's what you're talking about, then we agree. If you're talking about avoiding a cursor in order to build some other "loop-controlled set-based solution", which is what it sounds like, then I don't see the value to that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply