How to Delete Large Amounts of Data

Question

Post reply

How to Delete Large Amounts of Data

Jim Youmans-439383

SSCommitted

Points: 1657
More actions
October 15, 2021 at 12:00 am

#3936441

Comments posted to this topic are about the item How to Delete Large Amounts of Data

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246992 More actions · Answer 1

This is good advice, as far as it goes, but I suggest there's more that could have been covered here, including

a) Sample loop code, including a discussion about whether to use CHECKPOINT and/or BEGIN/COMMIT TRAN.

b) More emphasis on getting the SELECT part right, throughout the loop. I've seen cases where loops started off running quickly and then slowed down to a crawl because selecting the rows to be deleted was not fully thought through.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Thomas Franz Hall of Fame Points: 3901 More actions · Answer 2

The DELETE WHERE id IN (SELECT TOP x) is still a very bad solution. You wrote that it makes a table scan in your first example, so I assume that there is no index on the creation_date column, so with your suggested solution it would still do a table scan for every loop.

Much better would it to

DECLARE @id INT = (SELECT MIN(l.id) FROM dbo.log AS l WHERE l.creation_date >= DATEADD(DAY, -30, GETDATE());

WHILE 1 = 1
BEGIN
    DELETE TOP (100000) FROM dbo.log WHERE id < @id
    IF @@ROWCOUNT = 0 BREAK
END

This way it would do only a single table scan (when setting @id) and would be MUCH faster after this, since it could simply delete from clustered index (on the ID).

Another solution would be to cluster the log table by the creation_date (maybe plus the ID, if you insert more than one entry in ~3 milliseconds), since both values are ever increasing (exception: change from summer to wintertime where the time between 2 and 3 occurs twice if you are not using UTC), this will not change the sort order of the index.

If your table would have been partitioned (unlikely for a log table), you could have used TRUNCATE TABLE ... WITH (PARTITION x), which is the fastest way to delete ALL rows of a specific partition.

God is real, unless declared integer.

Mike01 SSChampion Points: 11412 More actions · Answer 3

Have you thought about how many rows are in the table to begin with? If 35m and you're deleting 30m, maybe a faster approach, to

begin with, is to insert 5m into a new table, then truncate and copy back. Just remember, if id column is identity, then it would need to be reset to 35.1M-ish

For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

iwem Valued Member Points: 70 More actions · Answer 4

Here is what I have done in the past:

SELECT TOP (50000) * 
INTO tmpEvt 
FROM [dbo].TblStatus order by ID DESC

TRUNCATE TABLE [dbo].TblStatus

SET IDENTITY_INSERT [MyDB].[dbo].TblStatus ON
GO
INSERT INTO [MyDB].[dbo].TblStatus (  [ID]
      ,[TimeStamp]
      ,[Segment]
      ,[Machine]
      ,[Line]
      ,[Pos]
      ,[PO_ID]
      ,[HF_ID]
      ,[Job]
      ,[StatusCode]
      ,[StatusText]
      )
SELECT  [ID]
      ,[TimeStamp]
      ,[Segment]
      ,[Machine]
      ,[Line]
      ,[Pos]
      ,[PO_ID]
      ,[HF_ID]
      ,[Job]
      ,[StatusCode]
      ,[StatusText]
      FROM [MyDB] .dbo.tmpEvt
  order by ID desc

DROP TABLE [MyDB].dbo.tmpEvt

DBCC SHRINKFILE (MyDB, 100);  
GO

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 5

Mike01 wrote:

Have you thought about how many rows are in the table to begin with? If 35m and you're deleting 30m, maybe a faster approach, to
begin with, is to insert 5m into a new table, then truncate and copy back. Just remember, if id column is identity, then it would need to be reset to 35.1M-ish

THIS! for the first delete. And if anyone is referring to a log table by log table ID, they need to have a serious bit of re-education.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

universe_force Newbie Points: 1 More actions · Answer 6

I suggest there's more that could have been covered here, for example:

Create a temp table having just the ID of the records to be deleted.
Delete the records using ID joining the temp table on ID that will be much faster.
Get count in temp table and divide delete records into smaller chunks as needed.

Any thoughts??

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 7

universe_force wrote:

I suggest there's more that could have been covered here, for example:
Create a temp table having just the ID of the records to be deleted.
Delete the records using ID joining the temp table on ID that will be much faster.
Get count in temp table and divide delete records into smaller chunks as needed.
Any thoughts??

universe_force wrote:

I suggest there's more that could have been covered here, for example:
Create a temp table having just the ID of the records to be deleted.
Delete the records using ID joining the temp table on ID that will be much faster.
Get count in temp table and divide delete records into smaller chunks as needed.
Any thoughts??

Tons!

First, see my post above where I agree with Mike01 on how to handle the initial DELETE. Of course, even that needs a heavy dose of "It Depends". For example, the OP never posted what the range of dates was for the deletes. If the 30 million row table only contains the most recent 61 days, then the copy/truncate/copy method would be the least desirable method to use. I suspect that's not the case, though and such a process could work in just several seconds, which could even be used in a 24/7 shop and cause only a minor delay if done properly.

Even if that's not possible, it would be a good idea to do like I have with my larger logging tables and that is to collect logable rows in a separate staging table to collect such logable rows to be inserted in bulk once a minute or so. That process would also delete the relatively small number of rows from that staging table that were just transferred. That also allows you to temporarily disable the transfer process so that you can easily continue the collection of logable rows while doing the very high speed copy/truncate/copy method for the first major set of deletes.

Another question would be, why in the world would one have such a log table with no index on the bloody date column? That makes no sense to me. That would be one of the first things I'd fix before even thinking about temporal deletes because it's going to be needed for future such deletes and it's amazing to me that no one querying the log table has ever complained of inquiries that require a date range.

Adding that index would also allow you (as it has for me) to create a "rolling 60 day delete" that would run every night without a whole lot of hoopla or other fanfare.

Of course, you could also setup monthly or week partitioning on the table and then deletes would be measured in milliseconds by using SWITCH for partition tables or just dropping a file and rebuilding a view for partitioned views (which would be my favorite).

The bottom line is that when you need to do such a massive delete, you should also fix the wealth of sins that led up to the problem to begin with, which would also make future fixes a whole lot easier and improve performance overall.

Personally, deleting millions of rows through a "crawler" proc such as what the article suggests is the last thing I'd do. You don't need the data anymore. Why would you want to log all of that data in your transaction log file and backups? It just makes no sense to me.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

NBSteve SSCarpal Tunnel Points: 4086 More actions · Answer 8

Not quite the same situation, but it shares some similarities with this article, in which Brent Ozar uses views to implement deletes from very large tables to avoid memory grant and sort spill issues.

https://www.brentozar.com/archive/2018/04/how-to-delete-just-some-rows-from-a-really-big-table/

Better? Worse? As always, it likely depends on your specific situation. Just another tool to be aware of for your toolbelt.

JediSQL SSCrazy Points: 2118 More actions · Answer 9

This is in response to Jim's aversion to cursors. As a lead DBA, I advised my team to avoid cursors unless necessary. There are times where set based processing is not possible, like calling a stored procedure for a large number of rows in a table. I found team members were writing WHILE loops that kept track of the rows of some table that needed stored procedure processing. I got into seeing whether I could optimize those loops for faster processing. But then it occurred to me, are these hand coded loops really faster than a cursor? So, I compared to the same work with a cursor, and the cursor was substantially faster. I totally agree that it is worth the effort to find and implement a set based approach whenever possible, but when it is not, and you have to loop over data in the database, cursors are easier to write and run faster than hand coded loops, especially when you use the STATIC FAST_FORWARD options.

This reply was modified 3 years, 2 months ago by JediSQL. Reason: punctuation

Sincerely,
Daniel

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 10

As much as I deplore the use of Cursors and While loops, I have to agree with JediSQL... sometimes there actually is no other way. You probably won't ever see me write a Cursor, though. Heh... I never "got good at it".

I did once try a Cursor on a "Bin Sort" problem. I also wrote a While loop solution for the same problem. In that particular case, the While loop solution blew the Cursor solution out of the water both for performance and resource usage.

There's a good possibility that was just because I stink at writing Cursors but it was a reminder to me that more than one thing must be considered. Of course, the first thing is, is a RBAR solution truly the only solution that can be had? Then the other question is, which method of RBAR will be the fastest and least resource intensive?

As always, the correct answer is "It Depends".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 11

Jim,

Only you can answer a few questions about your article that have been nagging at me because I'd like to do some extra testing by setting up some test data.

You mention that you're going to delete all but 60 days of data and that amounts to ~30 million rows being deleted.

How many rows where left in the 60 days?
How many bytes of storage space did that constitute?
How many indexes did you have including the clustered index?
What was average row size in the clustered index and each of the non-clustered indexes?
I assume that since it was a log table that it had no FKs to other tables nor did any other tables have FKs pointing that the log table. Can you verify that, please?
What tool did you use to measure the duration of each of your tests?
What was the recovery model being used?
Did you measure any of the resources being used including but not limited to both the number of log file entries and the bytes used by those entries?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

MMartin1 One Orange Chip Points: 27799 More actions · Answer 12

I've approached this type of situation , assuming I am deleting from the target table thouasnds of records that still would not make up a majority of the rows, using a loop with the "waitfor delay" option. I basically break up the deletes into chunks and allow time for overhead operations to finish and likely start the next batch fresh. I have not seen this go above a minute for 50k - 100k rows. Of course different evironments can lead to different results.

----------------------------------------------------

hmbacon SSC Journeyman Points: 79 More actions · Answer 13

I believe that the fast forward cursor of SQL Server is a great tool to use if, and only if, multiple actions need to be taken on each record returned. This use case is rare in my opinion. If they are being used in an OLTP database to populate multiple tables with the same data because of a high degree of denormalization, consideration should be given to going back to conceptual modeling to create a normalized database. If multiple cursors are being used to join multiple tables in procedural code instead of the join predicate in a select statement, performance will suffer. It is critical to tune the select statement in a cursor to ensure the optimum performance.

Cursors should not be declared always bad just like a baseball bat should not be declared always bad. It is how the tool is used that is good or bad. A bat is usually good when used in baseball and usually bad when committing a crime. But a bat could be good if it is used to prevent a crime.

hmbacon SSC Journeyman Points: 79 More actions · Answer 14

9. How many loops were used to determine duration and is it clock or cpu time.

10. Was the load on the system constant like in a stand alone database on a standalone computer.

How to Delete Large Amounts of Data

Cookies on SQLServerCentral