September 13, 2016 at 7:27 am
The Dixie Flatline (9/13/2016)
you might be able to use the false positives against an EXCEPT for final determination
Jeff, how am I going to know if it's a false positive or not, without doing a column by column comparison?
By the way, on a dedicated production-strength development server, EXCEPT took an hour and a half to handle about 2.5 million rows. The rows in question consisted of 150 columns mixed between character data, medium to long varchars, dates, and integers, with many of the columns being sparse.
Thanks, Bob...
To answer the question: If your current process returns the rows that it thinks are duplicated, just use EXCEPT to play those rows against the target to make sure that they not false positives.
On your test observation (thank you VERY much), how does that compare to the current hash method remembering that you need to include the time it takes to build the hashes.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 13, 2016 at 7:28 am
Yes, and it's a little annoying to keep hearing "if you'd designed your clustering key for this problem, it wouldn't be a problem." 🙂
You only get one clustering key, so you have to make it count. That comes from evaluating what are the most often used or time-critical applications.
Partitioning is a separate matter, although it influences the clustered key.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
September 13, 2016 at 7:40 am
Eric M Russell (9/12/2016)
4 million rows is actually not a lot, but it depends on the speed of your storage system and whether the table will need to remain online and usable while the deletion is in process. Essentially, you want to delete the rows in batches to reduce the size of active transaction logging. The following example deletes from a table in 100,000 row batches and waits for 1 minute between each delete operation. This wait window gives the storage system additional time to keep up and prevents extended periods of blocking by allowing other user processes an opportunity to start and complete. When no more rows to be deleted, no rows were affected by the last delete operation, it breaks from the loop. The number of rows per batch and wait time between batches can be adjusted as needed.WHILE 1 = 1
BEGIN;
DELETE TOP(100000) FROM MyTable WHERE <delete criteria>;
IF @@ROWCOUNT = 0 BREAK;
WAITFOR DELAY '00:01:00';
END;
Also, it helps a lot if the the delete criteria is indexed.
Batch deletes like this can be crippled by physical reads. You've probably set the batch size based on performance against cached data. The first one or two batches might well have the data to be deleted in cache and will be suitably fast. After that, the deletes can be dragged to the ground by physical reads - the rate limiting step of the process isn't how fast rows can be deleted, but how fast they can be read from disk. Cache priming can make a very significant difference to batch delete times. Uncommitted reads of the data to be deleted (including all indexes, not just the CI) and blackholed into variables take next to no time. We have one observation of a batched delete which would have taken six weeks cut down to a mere few hours using this method.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 13, 2016 at 7:42 am
coool_sweet (9/9/2016)
i have table which has 5 millions record and i need to delete 4 millions in production.what is the best approach to do it ?
A little feedback would be much appreciated. What kind of table is this and what are you trying to delete? What I mean by that is, is it an audit table where you're trying to "just" delete all rows from a certain date on back? If not, then please describe what the table is used for and the business rules for deleting rows because, at this point, we're just guessing and none of the proposed solutions may be correct for you.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 13, 2016 at 7:46 am
Jeff, I've got someone trying to get that information now. But by my calculations, it's doing 444 rows per second. That seems slow.
Curiously, the execution plan is choosing nested loop joins, even though the rows in both tables have identical primary keys.
The word just came back that they are doing comparable volume in about 5 minutes outside the database.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
September 13, 2016 at 7:53 am
ChrisM@Work (9/13/2016)
Eric M Russell (9/12/2016)
4 million rows is actually not a lot, but it depends on the speed of your storage system and whether the table will need to remain online and usable while the deletion is in process. Essentially, you want to delete the rows in batches to reduce the size of active transaction logging. The following example deletes from a table in 100,000 row batches and waits for 1 minute between each delete operation. This wait window gives the storage system additional time to keep up and prevents extended periods of blocking by allowing other user processes an opportunity to start and complete. When no more rows to be deleted, no rows were affected by the last delete operation, it breaks from the loop. The number of rows per batch and wait time between batches can be adjusted as needed.WHILE 1 = 1
BEGIN;
DELETE TOP(100000) FROM MyTable WHERE <delete criteria>;
IF @@ROWCOUNT = 0 BREAK;
WAITFOR DELAY '00:01:00';
END;
Also, it helps a lot if the the delete criteria is indexed.
Batch deletes like this can be crippled by physical reads. You've probably set the batch size based on performance against cached data. The first one or two batches might well have the data to be deleted in cache and will be suitably fast. After that, the deletes can be dragged to the ground by physical reads - the rate limiting step of the process isn't how fast rows can be deleted, but how fast they can be read from disk. Cache priming can make a very significant difference to batch delete times. Uncommitted reads of the data to be deleted (including all indexes, not just the CI) and blackholed into variables take next to no time. We have one observation of a batched delete which would have taken six weeks cut down to a mere few hours using this method.
You can also modify that code to skip the most expensive iteration of them all and that's the last iteration where nothing is found but it still had to do a full scan looking. If you're trying to delete in batches of 100K rows (for example), then any deletion that returns less than 100K (the batch size) is actually the last iteration.
For the given code above, the change to skip that last iteration is pretty simple...
WHILE 1 = 1
BEGIN;
DELETE TOP(100000) FROM MyTable WHERE <delete criteria>;
IF @@ROWCOUNT [font="Arial Black"]< 100000 [/font]BREAK;
WAITFOR DELAY '00:01:00';
END;
It won't help with the problem that you've identified, though.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 13, 2016 at 8:09 am
The Dixie Flatline (9/13/2016)
Jeff, I've got someone trying to get that information now. But by my calculations, it's doing 444 rows per second. That seems slow.Curiously, the execution plan is choosing nested loop joins, even though the rows in both tables have identical primary keys.
The word just came back that they are doing comparable volume in about 5 minutes outside the database.
Thanks, Bob. I really appreciate the comparison.
Apparently, there's some tipping point (common with a lot of things in SQL Server) for EXCEPT. I don't know if, in your case, it's based on the number of columns, type of columns, number of rows, or some combination thereof. And, yeah... I agree... 444 rows per second is horrible. The 129K row/40 column thing I previously mentioned (9 seconds using EXCEPT) turns out to be a full row comparison for all rows between the two 129K rows at a rate of more than 14,000 per second. Even that is a bit slow but it suffices for what we needed to do... especially since they haven't even tried other methods at work, yet. 😀
How many rows are you good folks comparing? And is the PK anywhere near the "front" of the row like it is the tables I compared?
--Jeff Moden
Change is inevitable... Change for the better is not.
September 13, 2016 at 8:15 am
ChrisM@Work (9/13/2016)
Eric M Russell (9/12/2016)
4 million rows is actually not a lot, but it depends on the speed of your storage system and whether the table will need to remain online and usable while the deletion is in process. Essentially, you want to delete the rows in batches to reduce the size of active transaction logging. The following example deletes from a table in 100,000 row batches and waits for 1 minute between each delete operation. This wait window gives the storage system additional time to keep up and prevents extended periods of blocking by allowing other user processes an opportunity to start and complete. When no more rows to be deleted, no rows were affected by the last delete operation, it breaks from the loop. The number of rows per batch and wait time between batches can be adjusted as needed.WHILE 1 = 1
BEGIN;
DELETE TOP(100000) FROM MyTable WHERE <delete criteria>;
IF @@ROWCOUNT = 0 BREAK;
WAITFOR DELAY '00:01:00';
END;
Also, it helps a lot if the the delete criteria is indexed.
Batch deletes like this can be crippled by physical reads. You've probably set the batch size based on performance against cached data. The first one or two batches might well have the data to be deleted in cache and will be suitably fast. After that, the deletes can be dragged to the ground by physical reads - the rate limiting step of the process isn't how fast rows can be deleted, but how fast they can be read from disk. Cache priming can make a very significant difference to batch delete times. Uncommitted reads of the data to be deleted (including all indexes, not just the CI) and blackholed into variables take next to no time. We have one observation of a batched delete which would have taken six weeks cut down to a mere few hours using this method.
So, enlighten us by posting your alternate version of the above.
I've done this many times in the past, and the biggest bottleneck is the transaction log, assuming that the qualifyer for row selection is indexed. In my case it's typically more like deleting 10 million rows from a 100 million row sized tables, so I set my batch size to 1 million. What this does is drop a 2 day long operation down to 2 hours or less. For specific case where we are dealing with a relatively small 4 million row table, a straight delete may suffice. In any event I would reccomend disabling any non-clustered indexes that don't qualify the row selection, and running an index re-org afterward.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
September 13, 2016 at 8:29 am
ChrisM@Work (9/13/2016)
Cache priming can make a very significant difference to batch delete times.
Curious... isn't that just moving the problem to something else? You still have to do the reads sometime. If you cache prime each iteration, does it really help performance and resource usage? Not asking to be contrary. It's an interesting thought that a combination of SELECT and DELETE might be faster than DELETE alone.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 13, 2016 at 9:09 am
Jeff Moden (9/13/2016)
ChrisM@Work (9/13/2016)
Cache priming can make a very significant difference to batch delete times.Curious... isn't that just moving the problem to something else? You still have to do the reads sometime. If you cache prime each iteration, does it really help performance and resource usage? Not asking to be contrary. It's an interesting thought that a combination of SELECT and DELETE might be faster than DELETE alone.
We noticed that batch deletes were taking about 25 seconds for the first couple of batches then whooshed out to around five minutes after that. The deletes were against an average width table with about 800M rows, from which we wanted to delete 200M.
When we investigated the process, we noticed that the early batches were all from cache (logical reads) and the poor performance coincided with a switch to physical reads.
From an email at the time:
"Working on this stuff yesterday…batch delete runs (from BigTable table) went horribly wrong after a few passes because the data required for the delete was not in data cache. Ed noticed this so we toyed with the idea of manually dragging the data off disk and into cache with a non-blocking read so that it was in cache for the blocking delete. This method appears to work so long as the non-blocking reads are specific enough that they only lift the data we want. Lift too much and something gives. After a little tinkering, we have the query below. The non-blocking read takes about 25 seconds and the blocking delete takes about half a second, for 10k row chunks.
Some good evidence that this little trick appears to work is the STATISTICS IO output from the DELETE statement, which is consistently
Table 'BigTable'. Scan count 0, logical reads 162443, physical reads 0"
Our delete code looks like this:
DELETE bt
FROM @Temp s
INNER LOOP JOIN dbo.BigTable bt ON s.ID = bt.ID
We played around with the rowcount somewhat then put the process into production, deleting 200M rows in about 6 hours with no impact on the users. We didn't have to change our log backup frequency either - just keep an eye on the file sizes 😉
Edit: Jeff, your PM mailbox is full.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 13, 2016 at 9:59 am
Eric M Russell (9/13/2016)
ChrisM@Work (9/13/2016)
Eric M Russell (9/12/2016)
4 million rows is actually not a lot, but it depends on the speed of your storage system and whether the table will need to remain online and usable while the deletion is in process. Essentially, you want to delete the rows in batches to reduce the size of active transaction logging. The following example deletes from a table in 100,000 row batches and waits for 1 minute between each delete operation. This wait window gives the storage system additional time to keep up and prevents extended periods of blocking by allowing other user processes an opportunity to start and complete. When no more rows to be deleted, no rows were affected by the last delete operation, it breaks from the loop. The number of rows per batch and wait time between batches can be adjusted as needed.WHILE 1 = 1
BEGIN;
DELETE TOP(100000) FROM MyTable WHERE <delete criteria>;
IF @@ROWCOUNT = 0 BREAK;
WAITFOR DELAY '00:01:00';
END;
Also, it helps a lot if the the delete criteria is indexed.
Batch deletes like this can be crippled by physical reads. You've probably set the batch size based on performance against cached data. The first one or two batches might well have the data to be deleted in cache and will be suitably fast. After that, the deletes can be dragged to the ground by physical reads - the rate limiting step of the process isn't how fast rows can be deleted, but how fast they can be read from disk. Cache priming can make a very significant difference to batch delete times. Uncommitted reads of the data to be deleted (including all indexes, not just the CI) and blackholed into variables take next to no time. We have one observation of a batched delete which would have taken six weeks cut down to a mere few hours using this method.
So, enlighten us by posting your alternate version of the above.
I've done this many times in the past, and the biggest bottleneck is the transaction log, assuming that the qualifyer for row selection is indexed. In my case it's typically more like deleting 10 million rows from a 100 million row sized tables, so I set my batch size to 1 million. What this does is drop a 2 day long operation down to 2 hours or less. For specific case where we are dealing with a relatively small 4 million row table, a straight delete may suffice. In any event I would reccomend disabling any non-clustered indexes that don't qualify the row selection, and running an index re-org afterward.
Eric, this is the actual proof of concept code we used, obfuscated:
SET TRANSACTION ISOLATION LEVEL READUNCOMMITTED
declare @id int -- blackhole variable
-- 1. Collect the IDs of the rows to delete
DECLARE @Temp TABLE (ID INT NOT NULL PRIMARY KEY, CompositeID BIGINT, CoffeeClubID INT, EndDate DATETIME, StartDate DATETIME)
INSERT INTO @Temp
SELECT TOP 10000 ID, CompositeID, CoffeeClubID, EndDate, StartDate
FROM dbo.CoffeeClubMember
ORDER BY ID DESC
-- 2. Draw selected rows into cache
SELECT @id = ccm1.ID
FROM @Temp s
INNER JOIN dbo.CoffeeClubMember ccm1 WITH (INDEX (IX_CoffeeClubMember_CompositeIDEndDateStartDate_InclCoffeeClubID))
ON s.CompositeID = ccm1.CompositeID AND (s.EndDate IS NULL OR s.EndDate = ccm1.EndDate) AND (s.StartDate IS NULL OR s.StartDate = ccm1.StartDate) AND s.ID = ccm1.ID
INNER JOIN dbo.CoffeeClubMember ccm2 WITH (INDEX (UIX_CoffeeClubMember_CoffeeClubIDCompositeIDStartEndDate))
ON s.CoffeeClubID = ccm2.CoffeeClubID AND s.CompositeID = ccm2.CompositeID AND s.ID = ccm2.ID
INNER JOIN dbo.CoffeeClubMember ccm3 WITH (INDEX (PK_CoffeeClubMember))
ON s.ID = ccm3.ID
-- 3. Run the delete
DELETE ccm
FROM @Temp s
INNER LOOP JOIN dbo.CoffeeClubMember ccm ON s.ID = ccm.ID
In practice we may have merged steps 1 and 2.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 13, 2016 at 10:06 am
ChrisM@Work (9/13/2016)
Jeff Moden (9/13/2016)
ChrisM@Work (9/13/2016)
Cache priming can make a very significant difference to batch delete times.Curious... isn't that just moving the problem to something else? You still have to do the reads sometime. If you cache prime each iteration, does it really help performance and resource usage? Not asking to be contrary. It's an interesting thought that a combination of SELECT and DELETE might be faster than DELETE alone.
We noticed that batch deletes were taking about 25 seconds for the first couple of batches then whooshed out to around five minutes after that. The deletes were against an average width table with about 800M rows, from which we wanted to delete 200M.
When we investigated the process, we noticed that the early batches were all from cache (logical reads) and the poor performance coincided with a switch to logical reads.
From an email at the time:
"Working on this stuff yesterday…batch delete runs (from BigTable table) went horribly wrong after a few passes because the data required for the delete was not in data cache. Ed noticed this so we toyed with the idea of manually dragging the data off disk and into cache with a non-blocking read so that it was in cache for the blocking delete. This method appears to work so long as the non-blocking reads are specific enough that they only lift the data we want. Lift too much and something gives. After a little tinkering, we have the query below. The non-blocking read takes about 25 seconds and the blocking delete takes about half a second, for 10k row chunks.
Some good evidence that this little trick appears to work is the STATISTICS IO output from the DELETE statement, which is consistently
Table 'BigTable'. Scan count 0, logical reads 162443, physical reads 0"
Our delete code looks like this:
DELETE bt
FROM @Temp s
INNER LOOP JOIN dbo.BigTable bt ON s.ID = bt.ID
We played around with the rowcount somewhat then put the process into production, deleting 200M rows in about 6 hours with no impact on the users. We didn't have to change our log backup frequency either - just keep an eye on the file sizes 😉
Edit: Jeff, your PM mailbox is full.
Is the rowcount setting within the queries what keep slowing these down? Having the query keep churning until it finds *exactly* 100K records (or whatever your batch size happens to be) tends to dramatically increase the execution time.
I've always favored walking the clustered index with a moving range of clustering keys instead of the TOP syntax. Usually end up firing the query more often, but with dramatically faster times. Note: the clustered key doesn't have to support the deletion criteria, just has to be navigable.
Something like:
declare @startID integer;
declare @batchsize integer;
declare @endid integer;
set @batchsize=100000;
set @startid=1;
select@startid=min(column_1),
@endid=max(column_1) from t1
while @startid<@endid
begin
delete from T1
where
column_1 between @startid and (@startID+@batchsize-1) AND --walking the clustering keys
fun%3=0 --put the actual deletion criteria here
set @startId=@startid+@batchsize
end
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
September 13, 2016 at 10:15 am
Matt Miller (#4) (9/13/2016)
ChrisM@Work (9/13/2016)
Jeff Moden (9/13/2016)
ChrisM@Work (9/13/2016)
Cache priming can make a very significant difference to batch delete times.Curious... isn't that just moving the problem to something else? You still have to do the reads sometime. If you cache prime each iteration, does it really help performance and resource usage? Not asking to be contrary. It's an interesting thought that a combination of SELECT and DELETE might be faster than DELETE alone.
We noticed that batch deletes were taking about 25 seconds for the first couple of batches then whooshed out to around five minutes after that. The deletes were against an average width table with about 800M rows, from which we wanted to delete 200M.
When we investigated the process, we noticed that the early batches were all from cache (logical reads) and the poor performance coincided with a switch to logical reads.
From an email at the time:
"Working on this stuff yesterday…batch delete runs (from BigTable table) went horribly wrong after a few passes because the data required for the delete was not in data cache. Ed noticed this so we toyed with the idea of manually dragging the data off disk and into cache with a non-blocking read so that it was in cache for the blocking delete. This method appears to work so long as the non-blocking reads are specific enough that they only lift the data we want. Lift too much and something gives. After a little tinkering, we have the query below. The non-blocking read takes about 25 seconds and the blocking delete takes about half a second, for 10k row chunks.
Some good evidence that this little trick appears to work is the STATISTICS IO output from the DELETE statement, which is consistently
Table 'BigTable'. Scan count 0, logical reads 162443, physical reads 0"
Our delete code looks like this:
DELETE bt
FROM @Temp s
INNER LOOP JOIN dbo.BigTable bt ON s.ID = bt.ID
We played around with the rowcount somewhat then put the process into production, deleting 200M rows in about 6 hours with no impact on the users. We didn't have to change our log backup frequency either - just keep an eye on the file sizes 😉
Edit: Jeff, your PM mailbox is full.
Is the rowcount setting within the queries what keep slowing these down? Having the query keep churning until it finds *exactly* 100K records (or whatever your batch size happens to be) tends to dramatically increase the execution time.
I've always favored walking the clustered index with a moving range of clustering keys instead of the TOP syntax. Usually end up firing the query more often, but with dramatically faster times. Note: the clustered key doesn't have to support the deletion criteria, just has to be navigable.
Something like:
declare @startID integer;
declare @batchsize integer;
declare @endid integer;
set @batchsize=100000;
set @startid=1;
select@startid=min(column_1),
@endid=max(column_1) from t1
while @startid<@endid
begin
delete from T1
where
column_1 between @startid and (@startID+@batchsize-1) AND --walking the clustering keys
fun%3=0 --put the actual deletion criteria here
set @startId=@startid+@batchsize
end
Matt, we used a table variable for a "moving range of clustering keys", at least in the POC case shown above. The statistic of one row would probably be a disaster for a hash match but for a NL join it really didn't matter - the hint gave us exactly the plan we wanted.
I'd love to see someone else trying this out rather than beigely agreeing that batch deletes are always choked by tlog activity.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 15, 2016 at 11:07 am
Here's another good way to handle massive deletes... No down time for the affected table and works especially well if the table is schema-bound by another object...
Partition SWITCH...
USE tempdb;
GO
--================================================================
-- Create an "original table" and populate it...
CREATE TABLE dbo.OriginalTable (
SomeNumber INT NOT NULL,
CONSTRAINT pk_OriginalTable_SomeNumber PRIMARY KEY CLUSTERED (SomeNumber)
);
WITH -- insert 10M rows...
n (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
cte_Tally (n) AS (
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM n n1, n n2, n n3, n n4, n n5, n n6, n n7
)
INSERT dbo.OriginalTable (SomeNumber)
SELECT t.n FROM cte_Tally t;
--================================================================
-- Do a minimally logged insert of the last 1M rows into a a new table and then add the PK/clustered index
SELECT TOP 1000000
ot.SomeNumber
INTO dbo.NewTable
FROM
dbo.OriginalTable ot
ORDER BY
ot.SomeNumber DESC;
ALTER TABLE dbo.NewTable ADD CONSTRAINT pk_NewTable_SomeValue PRIMARY KEY CLUSTERED (SomeNumber);
--================================================================
-- Create an "empty table" to accommodate the 1st SWITCH.
CREATE TABLE dbo.EmptyTable (
SomeNumber INT NOT NULL,
CONSTRAINT pk_EmptyTable_SomeNumber PRIMARY KEY CLUSTERED (SomeNumber)
);
--------------------
-- switch the OriginalTable to the EmptyTable...
ALTER TABLE dbo.OriginalTable SWITCH TO dbo.EmptyTable;
--------------------
-- switch theNewTable to the OriginalTable ...
ALTER TABLE dbo.NewTable SWITCH TO dbo.OriginalTable;
--================================================================
-- Cleanup after yourself...
DROP TABLE dbo.NewTable;
DROP TABLE dbo.EmptyTable;
--================================================================
-- Check results
SELECT
MinVal = MIN(ot.SomeNumber),
MaxVal = MAX(ot.SomeNumber),
Row_Count = COUNT(*)
FROM
dbo.OriginalTable ot
September 15, 2016 at 1:56 pm
Late to the party, but I have had very good success with value-driven-index-seek based deletes (where possible) too.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 15 posts - 16 through 30 (of 35 total)
You must be logged in to reply to this topic. Login to reply