How to Delete Large Amounts of Data

Question

How to Delete Large Amounts of Data

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply

Mike01 SSChampion Points: 11395 More actions · Answer 1

hmbacon wrote:

I believe that the fast forward cursor of SQL Server is a great tool to use if, and only if, multiple actions need to be taken on each record returned.

Opening Pandora's box on this one. I can't remember the last time I used one. Almost all code that needs one can be resolved with a set-based solution, but there are the outliers.

For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

hmbacon SSC Journeyman Points: 79 More actions · Answer 2

I indicated that the need for a cursor is rare but you need to be willing to use them. If you need to update 6 tables based on the data in a 7th table, SQL Server does not have a multi-table update statement. How would you do this with 1 set based statement?

MMartin1 One Orange Chip Points: 27799 More actions · Answer 3

I would not fixate on sticking to one set based statement as I would to transactions. Each loop in a cursor can be thought of as a transaction . Where as I could update multiple tables with parallel transactions from one source (no exclusive lock needed).

Such a use case would be updating multiple test environments to be in sync.

----------------------------------------------------

Thomas Franz Hall of Fame Points: 3897 More actions · Answer 4

@hmbacon: you could either do 6 Updates each joining the base table or - if the select of the base table is slow / difficult - write the ids, that needs to be updated into a #tmp and join the #tmp

@mike01:

When you have to work with 100 mio rows you could do it with a set based statement, which runs 3 hours (bad when it fails for whatever reasons) or break it up into smaller chunks of maybe 500k rows. Of course it would cripple your system when you would write a cursor that perfoms single row actions or with only 10 rows in this case, but the "smoother" running of a good sized cursor vs. a gigantic monolith statement oftens makes it a viable solution.

God is real, unless declared integer.

Phil Parkin SSC Guru Points: 246986 More actions · Answer 5

MMartin1 wrote:

Each loop in a cursor can be thought of as a transaction

..Can be thought of? Are you saying that if a loop in a cursor contains multiple actions, and any one of those actions fails, that all of the other actions are rolled back? The following code suggests not:

DROP TABLE IF EXISTS #TranTest;

CREATE TABLE #TranTest
(
    n INT NOT NULL
);

INSERT #TranTest
(
    n
)
VALUES
(1  )
,(2)
,(3);

DECLARE @n INT;

DECLARE test_cursor CURSOR STATIC FORWARD_ONLY READ_ONLY LOCAL FOR
SELECT tt.n
FROM #TranTest tt;

OPEN test_cursor;

BEGIN TRY
    FETCH NEXT FROM test_cursor
    INTO @n;

    WHILE (@@FETCH_STATUS = 0)
    BEGIN
        --BEGIN TRAN;

        INSERT #TranTest
        (
            n
        )
        VALUES
        (@n + 10);

        IF @n = 2
            PRINT 1 / 0;

        --COMMIT;

        FETCH NEXT FROM test_cursor
        INTO @n;
    END;

    CLOSE test_cursor;
    DEALLOCATE test_cursor;
END TRY
BEGIN CATCH
    IF XACT_STATE() <> 0
        ROLLBACK;

    SELECT *
    FROM #TranTest tt;

    CLOSE test_cursor;
    DEALLOCATE test_cursor;
END CATCH;

If you uncomment the BEGIN TRAN and COMMIT rows and run this, the value '12' does not appear in the final result set.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

MMartin1 One Orange Chip Points: 27799 More actions · Answer 6

THank you for that piece of information. However my comment centered specifially on the post about inserting into multiple tables from one source , and how that can be done in parrallel (each its own transaction) . This is similar, though not exact , do using a cursor to achieve the same outcome. Hence the absrtaction "can be thought of" in this context.

----------------------------------------------------

alexander.huberti SSC-Addicted Points: 461 More actions · Answer 7

I still like this solution when I need to delete a lot of data,

https://michaeljswart.com/2014/09/take-care-when-scripting-batches/

David.Poole SSC Guru Points: 76035 More actions · Answer 8

If I knew I needed a regular purge job based on record creation datetime then I would be looking at whether that datetime column should be a candidate for the clustered index.

One of the things I miss having moved to other DB platforms is that SQL Server allows a clustered index to be something other than the primary key.

If the clustered index is on any form of sequential column then there is the possibility of WHERE column BETWEEN. The downside to that is if the sequential column values rotate.

At a previous employer we used partition switching to do massive deletes.

As Jeff mentioned earlier, its worth working out whether it would be more efficient to copy the records you need, switch tables and truncate the original.

Back in the days before partition switching we used an approach that had physical tables aligned to date/time segments. There was a view over the top of the tables that did a UNION ALL so the month-end job created a new table, renamed the previous tables and did an sp_refreshview. Sounds horribly clunky now but it was a reliable process.

LinkedIn Profile

Thomas Franz Hall of Fame Points: 3897 More actions · Answer 9

David.Poole wrote:

At a previous employer we used partition switching to do massive deletes.

when you already have partitions, you could simply do a

TRUNCATE TABLE dbo.big_stuff WITH (PARTITION(1 TO $partition.pf_big_stuff_function(date_key)));

Or instead of using 1 TO you could just provide a single partition, that you either get by the $partition-syntax or e.g. from a CURSOR which is joining sys.partition_functions and sys.partition_range_values (boundary_value is the partition number).

If there are some lines at this day that must not be deleted e.g. for legal reasons you could insert them into a #tmp and write them back after the TRUNCATE

God is real, unless declared integer.

szeitzp SSC Enthusiast Points: 138 More actions · Answer 10

Your reply reminded me of a problem I ran into long ago. Although it was Oracle, my thoughts were similar to your questions. It was a big table that became unmanageable with time. I remember the whole IT department getting a classy PTSD when they needed to delete a large amount of data from it, and the only time window in a year for the deletion turned out to be almost too short. It would be a disaster if a bank suspends the cash operations at the opening.

How could it happen in a bank to use a wrongly designed database system? Did anyone ask the proper questions at the start of the database design? Didn’t anyone expect multimillion-rows in a table at a large bank?

When I read this article, I thought the same way you did. OK, here are x million rows; we must delete 30m from them. I understand this is an imaginary situation; they don’t need to care about why it happened or the other circumstances. But you can’t answer “How to Delete Large Amounts of Data” without those answers. I got the critique a few times: You spent too long analysing the problem. Maybe they were right, but I’m sure I won’t start it with a simple DELETE, even not with 5000 or 10,000 rows at once.

aroggeband-815867 SSC Enthusiast Points: 169 More actions · Answer 11

We had the same problem today.

I wrote a script that does the following:

Starts a transaction
retrieves the last id number of the table
uses sp_rename to rename the PK, indexes and constraints on the table
sp_renames the table
executes the create table script for the table to create a new, empty log table
reseed the Log_ID to one more than the value retrieved before renaming
end the transaction (rollback for testing, commit when it runs clean)

The script cleaned up 36 million records in a fraction of a second.

jonathan.ellison SSC Enthusiast Points: 148 More actions · Answer 12

This is a very relevant article and far too often code is written in such a way that Log space is consumed as a result of attempting to delete too much data in one action. Knowing the "sweet spot" in how many rows of data can be easily deleted quickly is crucial. This will differ based on the construction of the table: number of columns, data types, amount of data per row, etc...

Here is a simple code I use for handling such deletes.

Declare @CT INT -- variable used to compute number of records to be deleted

Set @CT = (select count(ID) from table where CreateDate < GetDate()-60)

While @CT > 0

Begin

Delete top 4000 from Table where CreateDate < Getdate() -60

Set @CT = @CT -4000

End

In this scenario, 4000 is the optimal rows of data to be deleted in a single pass. This simple loop will quickly and easily cycle through the data to be deleted without impacting the LOG file. Each cycle is fully committed so in case of a ROLLBACK, only 4000 records would need to be recovered via ROLLBACK.

Yes, I fully understand there are other ways to code this, and that is what makes SQL such a great tool. The goal is to produce the end result in an efficient manner with minimal impact. I hope this simple tip helps you become a better at your job.

Thomas Franz Hall of Fame Points: 3897 More actions · Answer 13

Sorry, but this script isn't really good in my opionion.

why are you counting the numbers of rows to delete, this is a unneccessary big read
you could write a WHILE 1 = 1 and exit the endless loop with an IF @@ROWCOUNT < 4000 BREAK
when there is no index on the CreateDate it would do a lot of table / clustered index scans to find the rows that needs to be deleted
4000 row is really tiny as batchsize - and no, a log file of 100 MB is not big nowadays (~25 kB for every of the 4k deleted rows; most should be much shorter) . If your database is not sitting on a undersized server with to less disk space and/or slow disks and to many databases on it, just increase the log size to a reasonable amount (or let it grow by itself by setting just the max size) and use a few 100k or even 1 mio as batch size. If there are just a few thousand rows to be deleted, than it would run just one loop. If there is really a lot of stuff to delete and your logfile grows to x GB over its usual size, you can shrink it later again.
if there is no index on the CreateDate you may want to get the MAX(id) first (in this case you could count too, if you need it for logging) and delete based on the id
On current hardware (and with a fitting index) deleting a few 100k rows from a table should take just a few seconds and since you are deleting in the past it should lock just the old pages, so it should be possible to still write to the log table while the delete is running

God is real, unless declared integer.

jonathan.ellison SSC Enthusiast Points: 148 More actions · Answer 14

Thank you for your feedback Thomas, but to clarify, my script was based on the example used in the original article. At that time, the originating author showed "4000" to be the optimal number of records to delete in his example. I was simply offering a very valid solution based on his example. As I stated, there are numerous ways to achieve the same result, some will just have less impact (resource wise, performance wise, etc) than others.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 15

Thomas Franz wrote:

when you already have partitions, you could simply do a
TRUNCATE TABLE dbo.big_stuff WITH (PARTITION(1 TO $partition.pf_big_stuff_function(date_key)));

I leverage partitions to truncate data on large tables as well, and I love your implementation that dynamically gets the set of partition IDs to be truncate using the partition function.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho