November 30, 2007 at 7:29 am
I have a staging area where I have to delete massive amounts of. Currently I am deleting it the traditional way with a delete statement executing the top level and the rest being taken care of with cascading deletes through the rest of the schema. I would like to try to optimize this.
I know TRUNCATE TABLE will delete without writing entries to the tlog but my deletes are row based, not for the whole table so this isn't an option for me. Is there a way to make DELETE act like TRUNCATE TABLE and bypass the transaction log for the delete to speed up the operation?
November 30, 2007 at 8:23 am
Chris Gierlack (11/30/2007)
I have a staging area where I have to delete massive amounts of. Currently I am deleting it the traditional way with a delete statement executing the top level and the rest being taken care of with cascading deletes through the rest of the schema. I would like to try to optimize this.I know TRUNCATE TABLE will delete without writing entries to the tlog but my deletes are row based, not for the whole table so this isn't an option for me. Is there a way to make DELETE act like TRUNCATE TABLE and bypass the transaction log for the delete to speed up the operation?
Unfortunately, using the delete, there is not too much you can do. Basically every delete operation will result in an entry in the tlog, in any recovery mode. If the number of rows that you keep is smaller than the number of deleted rows, you may want to consider a copy data, replace table.
Regards,
Andras
November 30, 2007 at 8:27 am
Another thought, do not forget about the indexes 🙂 If you are deleting a large number of rows, you may want to disable unused indexes, and then rebuild them.
Regards,
Andras
November 30, 2007 at 8:35 am
Within my cascading deletes, will the clustered indexes between the foreign key constraints speed up the delete or will dropping the clustered indexes improve the delete?
November 30, 2007 at 9:02 am
My guess is the clustered index is faster, but it's a guess. It provides a more ordered way for the optimizer to find the rows and delete them, but it probably isn't significantly faster. Be an interesting test.... (hmm)
If you can do this while things are quiet or there isn't a lot of other activity, set the db to simple recovery mode and run small batches of deletes. This will keep the T-log from growing too crazily. When you're done, set it back to full and run a full backup immediately.
November 30, 2007 at 10:53 am
Remember that the foreign key constraints in your cascading deletes makes them slower. If you are deleting a row from a child table, the PK value of that table is checked on all of the foreign keys to make sure you did not violate the integrity.
If you have a big process and are dropping indexes, you may also want to drop the foreign keys.
November 30, 2007 at 12:02 pm
Good point.....this is all being done in a staging area so the indexes/foreign keys that would apply to my OLTP environment don't necessarily apply here....
November 30, 2007 at 2:26 pm
Heh... You do DELETEs? In a "staging area"? I'd just mark the rows with a flag of some sort and tell my procs to ignore those rows.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 30, 2007 at 2:54 pm
Yeah already do that....but when you're dealing in hundreds of millions of records with finite resources you have to delete it sometime.....and when the garbage man comes around to remove the records flagged for deletion, you don't want the server to be running ragged for hours at a time.....
November 30, 2007 at 3:35 pm
Ummmm... OK... Guess I don't understand... why are you keeping any rows for a long period of time in a "staging area"... all of that data should be expendable to TRUNCATE. Keeping rows in a staging area seems to defy the purpose of a staging area.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 1, 2007 at 4:21 pm
I'm not keeping the rows. There is a steady stream of large amounts of data from different sources coming into the staging area. The end user applies some transformations to their data and I cannot delete it until they are done. It is here that I have the issue....I have to remove the staged data that is only logically separated within the same set of staging tables so I don't affect another user's data.
If each user could have their own staging area and/or set of tables, I would be using truncate or even drop/create and there would be no issues.... but as it is, I have to carve out the data with delete statements and doing this is expensive....
It's like SQL needs an equivalent of Bulk Insert for Bulk Removal. Bulk insert is so fast with minimal transaction logging.....if there was the same thing with delete, my life would be that much easier. Alas, deleting by row/primary key is not nearly the same operation as a bulk insert so I realize it just isn't going to happen but I thought someone out there might have found a way to delete without the overhead....
I can run a job at night that does this and have my processes ignore the data flagged for deletion but this won't hold up as the database gets into the TB range.....
December 1, 2007 at 5:16 pm
but when you're dealing in hundreds of millions of records with finite resources you have to delete it sometime.....and when the garbage man comes around to remove the records flagged for deletion, you don't want the server to be running ragged for hours at a time.....
Ok... you have "Hundreds of millions of records"... In the staging area? And, how many of them do you need to delete in a 24 hour period?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 2, 2007 at 5:03 am
Get rid of the FKs - do your cascaded deletions manually. You could also try to do the deletes in smaller lots using a loop and "set rowcount" - it's usually faster (never formally tested by me though!) to delete 1 million records by deleting 10000 records 100 times than deleting 1,000,000 records once.
You could also try to rejig your schema as a permanent solution? 🙂 (I'm sure you've thought of this but are hoping to avoid the pain - don't blame you for posing the question).
December 2, 2007 at 7:55 am
I've done some testing and if you put an exclusive lock on a table you're deleting from and leave only the PK (doesn't matter if clustered or not), takes 15 seconds to find and delete 500,000 rows from a million row table... takes only 3 seconds if no indexes whatsoever. I agree... cascading deletes will probably be slower than specific deletes.
Transaction log doesn't grow by much during the 500,000 row delete test when the recovery mode is SIMPLE. You can always do a BACKUP LOG WITH TRUNCATE ONLY in between deletes.
If you can't get away with any of that, then Ian's suggestion of building a loop to delete smaller chunks is the way to go... that's why I asked for an estimate on how many rows to delete per day... could build an "all day crawler" that deletes some rows, wait's ten seconds for other processes, deletes, waits, etc. If you use exclusive tablocks during the deletes, the deletes will happen very quickly and virtually no chance of blocking conflicts... the crawler will wait until it can establish a lock, do it's deed, and get out in a hurry. Others may see only a minor blip of wait every once in a while.
I've not heard of a "staging area" that is quite so complex and difficult to do deletes from. Certainly, I've not heard of one that ends up with hundreds of millions of rows "stuck" in it that need to be deleted... I'd suggest taking a step back and consider a redesign... should be able to do truncates followed by drops in a staging area.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 4, 2007 at 6:07 pm
Thanks for the suggestions. I've put together several scenarios and will be testing them over the next couple weeks. In an isolated environment the tablock works great but I'm a little worried about what will happen when several sources are using it at once while the deletes are running. Need to setup some performance tests with this method and see if I can blend yours and Ian's suggestions together. I like deleting in chunks and on set intervals. If I can find a happy median where it's cleaning up off hours and still keep the db size manageable versus incoming traffic then i'll be golden.
You're right, this is not a typical staging area....but that's definitely what it is. Imagine you have thousands of customers that have been using local data stores all over the world for years and now will be converting to a centralized solution and of course they want all of their data to come with them. The data can't come directly into the OLTP db for many reasons so you're forced to deal with it in the staging area and then move it when the customer is ready....
It's just a good thing there is bulk insert or the other side of the process would be a nightmare too....
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply