How to batch delete large table?

Question

How to batch delete large table?

lmacdonald

Hall of Fame

Points: 3763
More actions
June 16, 2015 at 2:22 pm

#398606

Hi,
I have a table with about 466 Million rows. In this table there is a int column called WeeksToRetain as well as a EventDate column containing the date the row was inserted. I am trying to delete all the rows that that should be deleted according to the WeeksToRetain. For example, if the EventDate is 5/07/15 with a 1 in the WeeksToRetain column the row should be removed by 5/14/15. I am not sure what days SQL considers the beginning and end of the week. However the core issue I am having is the sheer mass of deletions I must do and log growth. So I am trying to do the delete in batches. More specifically I want to load a temporary table with a million rows, then use the temporary table to load a sub temporary table with 100,000 rows and join this temporary table to the table I want to delete from looping through 10 times to get 1 million. I am having trouble writing this because well it's pretty much my first time doing so getting confused in the logic. Any help could include another way of doing this all together, however here is what I wrote so far and I'm stumped. The Logging.EvenLog table which is the table I'm trying to purge has a clustered index on EventDate (ASC). I would like to run this in a schedule job with enough time between executions for log backups to run. Any help is much appreciated!!
DECLARE @i int
DECLARE @RowCount int
DECLARE @NextBatchDate datetime
CREATE TABLE #BatchProcess
(
EventDate datetime,
ApplicationID int,
EventTypeID int
)
CREATE TABLE #SubBatchProcess
(
EventDate datetime,
ApplicationID int,
EventTypeID int
)
-- Check to see if any records in Logging.EventLog are past their weeks to retain and insert them into #BatchProcess table
INSERT INTO #BatchProcess
SELECT TOP(1000000)
EventDate,
ApplicationID,
EventTypeID
--DATEDIFF (ww, EventDate, GETDATE()) AS WeeksOld
FROM Logging.EventLog a
WHERE (WeeksToRetain > 0
AND DATEADD(ww, WeeksToRetain, eventDate) < GETDATE())
OR WeeksToRetain IS NULL
ORDER BY EventDate DESC
set @Rowcount = @@RowCount
set @i = 0
-- Enter loop if rows where inserted from previous step.
while (@Rowcount > 0 and @i < 11)
Begin
--Select a smaller set into #SubBatchProcess table.
INSERT INTO #SubBatchProcess
SELECT TOP(100000) EventDate,ApplicationID,EventTypeID
From #BatchProcess b
IF @@ROWCOUNT > 0 -- If set count is greater then 0, join set with Logging.EventLog and delete that set.
BEGIN
DELETE Logging.EventLog
From Logging.EventLog le
INNER JOIN #SubBatchProcess b on le.EventDate = b.EventDate and le.ApplicationID = b.ApplicationID and le.EventTypeID = b.EventTypeID
-- Walk the table
SET @NextBatchDate = (select top 1 EventDate
FROM #SubBatchProcess
Order by EventDate asc);
DELETE FROM #SubBatchProcess
-- Insert the next 100,000 from the #BatchProcess table
INSERT INTO #SubBatchProcess
Select top 100000 EventDate, ApplicationID, EventTypeID
FROM #BatchProcess
WHERE EventDate < @NextBatchDate
ORDER by EventDate desc
SET @RowCount = @@ROWCOUNT
SET @i+=
END
DROP TABLE #BatchProcess

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 1

First, how many rows are you deleting versus how many are you keeping?

Second, have you read this article: http://www.sqlservercentral.com/articles/T-SQL/67898/%5B/b%5D">http://www.sqlservercentral.com/articles/T-SQL/67898/

This reply was modified 6 years, 10 months ago by Lynn Pettis.
This reply was modified 5 years, 9 months ago by Steve Jones - SSC Editor.

lmacdonald Hall of Fame Points: 3763 More actions · Answer 2

I haven't, I will take a look.

I am querying for a count right now, but it will take a long time. I estimate half, so 250 Million...

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 3

lmacdonald (6/16/2015)
I haven't, I will take a look.
I am querying for a count right now, but it will take a long time. I estimate half, so 250 Million...

In addition to the counts, can there be down time during the purge, I assume you have a PK on the table to identify each row uniquely, and then how many indexes are there defined on the table.

lmacdonald Hall of Fame Points: 3763 More actions · Answer 4

There can be minimal down time. This table is used to log application events, so it doesn't really effect end users but there are sql jobs that load this table that run about every hour. It would be fine if they got a few hours behind. 43 Minutes going and I still have not gotten a count back. There is one clustered index on the table on the EventDate, it's not unique. There is no PK, which is why I am joining on several columns, and also why the it takes forever to query the table, probably a massive scan going on. I may consider building an index that will aid my delete, however it being so large it will surely lock the table.

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 5

Is there a set of columns taken together that are unique?

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 6

Since this is a log table does that mean the event date is always increasing?

ScottPletcher SSC Guru Points: 101235 More actions · Answer 7

I think you're adding enormous amounts of overhead to this process. How about something like below? I'm not 100% sure on the logic for EventDate being NULL, so please verify that as well.

For efficiency, you want to process the table in cluster key order. But, if finding 50,000 rows to DELETE takes too long, naturally you can adjust that number down.

IF OBJECT_ID('tempdb.dbo.#EventDates') IS NOT NULL

DROP TABLE #EventDates

CREATE TABLE #EventDates(

EventDate datetime NULL

)

DECLARE @i int

DECLARE @RowCount int

DECLARE @Total_Rows_Deleted int

DECLARE @Max_Rows_To_Delete int

DECLARE @EventDate datetime

SET @RowCount = 0

SET @Total_Rows_Deleted = 0

SET @Max_Rows_To_Delete = 100000

SET @EventDate = '19000101'

WHILE @RowCount < @Max_Rows_To_Delete

BEGIN

DELETE TOP (50000) /* change # rows as needed */

FROM el

OUTPUT DELETED.EventDate INTO #EventDates ( EventDate )

FROM Logging.EventLog el

WHERE

el.EventDate >= @EventDate AND

(el.WeeksToRetain IS NULL) OR

(el.WeeksToRetain > 0 AND DATEADD(DAY, el.WeeksToRetain * 7, el.EventDate) < GETDATE())

SET @RowCount = @@ROWCOUNT

IF @RowCount = 0

BREAK;

SET @Total_Rows_Deleted = @Total_Rows_Deleted + @RowCount

SELECT @EventDate = MAX(EventDate)

FROM #EventDates

TRUNCATE TABLE #EventDates

END --WHILE

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

lmacdonald Hall of Fame Points: 3763 More actions · Answer 8

Yes the EventDate keeps increasing, as in new entries are added with the most recent dates. As I said there is no unique index, there is only that index I already mentioned. However we have a maintenance night coming up in two days and I think I will try to add a good index by then. We have the database in a staging environment and that version of the table has different indexes and I was able to select/delete data much faster. Thank you for that query! I am done for the day but I'll take a look tomorrow.

codykonior SSCommitted Points: 1591 More actions · Answer 9

Don't delete rows in batches of 50,000. Anything >= 5,000 may exceed the default lock escalation threshold and take a table lock. And you don't need two temporary tables when one will do. Some advice:

- Create an index on the temp table and use a batch separator (Go) between populating it and using it. The reason for this is so that the query optimiser can see there are 250 million rows in there instead of assuming there is only 1 row.

- Then execute your Delete Top (2000) batches. I've seen others use a variable like Top (@Batch_Size) which also makes the optimizer assume 1 row, so don't do that, not that you have.

Assuming you are working on a clustered index key (and if you're doing it by date, were able to extract everything into the temp table without causing any blocking, however it worked out what was required), I've seen similar code run on similarly large scales on very busy databases without any application impact (e.g. an extra few seconds to run queries but not locking, blocking, and query timeouts). If you were doing anything out of the ordinary though like have CDC or replication, or ran out of log space in FULL mode, it could become more troublesome.

Luckily we were able to test first in lower environments and find these kinks before doing the real thing.

Andrei Hetel Old Hand Points: 393 More actions · Answer 10

It could be another approach. If you are deleting 50% of the records as mentioned in the earlier post, it could be quicker to do something like this:

- stop the jobs writing to your table

- create a new table, identical structure

- copy in the temp table records you wanna keep

- delete your table and rename the temporary one

Advantage is that the table will not be fragmented and log is not affected by the large delete operation.

lmacdonald Hall of Fame Points: 3763 More actions · Answer 11

Thanks everyone. All good tips. I was incorrect about the indexes. I'm not sure but I must have been looking at the wrong table or server because there is actually a non clustered index as well, and the clustered index contains 3 columns. I'm not sure why it's so hard to pull data from this database, even smaller select statements on the table take forever, and on the staging environment they are very fast. Anyway you can consider this thread closed, although not resolved I have enough to keep me going for a while.

drew.allen SSC Guru Points: 77014 More actions · Answer 12

codykonior (6/16/2015)
And you don't need two temporary tables when one will do.

You don't need any temporary tables at all. The temporary tables just add unnecessary overhead.

DECLARE @Row_Batch INT = 50000; -- change # rows as needed

WHILE @@ROWCOUNT > 0 -- you can also add criteria here to prevent it from running at peak times.

BEGIN

DELETE TOP (@Row_Batch)

FROM el

FROM Logging.EventLog el

WHERE

ISNULL(el.WeeksToRetain, 0) > 0

AND DATEADD(WEEK, el.WeeksToRetain, el.EventDate) < SYSDATETIME()

END --WHILE

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 13

Andrei Hetel (6/16/2015)
It could be another approach. If you are deleting 50% of the records as mentioned in the earlier post, it could be quicker to do something like this:
- stop the jobs writing to your table
- create a new table, identical structure
- copy in the temp table records you wanna keep
- delete your table and rename the temporary one
Advantage is that the table will not be fragmented and log is not affected by the large delete operation.

You still have the log entries for the insert into the temp table.

Andrei Hetel Old Hand Points: 393 More actions · Answer 14

Lynn Pettis (6/17/2015)
Andrei Hetel (6/16/2015)
It could be another approach. If you are deleting 50% of the records as mentioned in the earlier post, it could be quicker to do something like this:
- stop the jobs writing to your table
- create a new table, identical structure
- copy in the temp table records you wanna keep
- delete your table and rename the temporary one
Advantage is that the table will not be fragmented and log is not affected by the large delete operation.
You still have the log entries for the insert into the temp table.

Sorry, forgot to mention about TABLOCK hint on the target table.That will reduce logging to minimum.