How to Delete Large Amounts of Data

  • hmbacon wrote:

    I believe that the fast forward cursor of SQL Server is a great tool to use if, and only if, multiple actions need to be taken on each record returned. 

     

    Opening Pandora's box on this one.  I can't remember the last time I used one.  Almost all code that needs one can be resolved with a set-based solution, but there are the outliers.

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • I indicated that the need for a cursor is rare but you need to be willing to use them.  If you need to update 6 tables based on the data in a 7th table, SQL Server does not have a multi-table update statement.  How would you do this with 1 set based statement?

  • I would not fixate on sticking to one set based statement as I would to transactions. Each loop in a cursor can be thought of as a transaction . Where as I could  update multiple tables with parallel transactions from one source (no exclusive lock needed).

    Such a use case would be updating multiple test environments to be in sync.

    ----------------------------------------------------

  • @hmbacon: you could either do 6 Updates each joining the base table or - if the select of the base table is slow / difficult - write the ids, that needs to be updated into a #tmp and join the #tmp

    @mike01:

    When you have to work with 100 mio rows you could do it with a set based statement, which runs 3 hours (bad when it fails for whatever reasons) or break it up into smaller chunks of maybe 500k rows. Of course it would cripple your system when you would write a cursor that perfoms single row actions or with only 10 rows in this case, but the "smoother" running of a good sized cursor vs. a gigantic monolith statement oftens makes it a viable solution.

    God is real, unless declared integer.

  • MMartin1 wrote:

    Each loop in a cursor can be thought of as a transaction

    ..Can be thought of? Are you saying that if a loop in a cursor contains multiple actions, and any one of those actions fails, that all of the other actions are rolled back? The following code suggests not:

    DROP TABLE IF EXISTS #TranTest;

    CREATE TABLE #TranTest
    (
    n INT NOT NULL
    );

    INSERT #TranTest
    (
    n
    )
    VALUES
    (1 )
    ,(2)
    ,(3);

    DECLARE @n INT;

    DECLARE test_cursor CURSOR STATIC FORWARD_ONLY READ_ONLY LOCAL FOR
    SELECT tt.n
    FROM #TranTest tt;

    OPEN test_cursor;

    BEGIN TRY
    FETCH NEXT FROM test_cursor
    INTO @n;

    WHILE (@@FETCH_STATUS = 0)
    BEGIN
    --BEGIN TRAN;

    INSERT #TranTest
    (
    n
    )
    VALUES
    (@n + 10);

    IF @n = 2
    PRINT 1 / 0;

    --COMMIT;

    FETCH NEXT FROM test_cursor
    INTO @n;
    END;

    CLOSE test_cursor;
    DEALLOCATE test_cursor;
    END TRY
    BEGIN CATCH
    IF XACT_STATE() <> 0
    ROLLBACK;

    SELECT *
    FROM #TranTest tt;

    CLOSE test_cursor;
    DEALLOCATE test_cursor;
    END CATCH;

    If you uncomment the BEGIN TRAN and COMMIT rows and run this, the value '12' does not appear in the final result set.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • THank you for that piece of information. However my comment centered specifially on the post about inserting into multiple tables from one source , and how that can be done in parrallel (each its own transaction) . This is similar, though not exact , do using a cursor to achieve the same outcome. Hence the absrtaction "can be thought of" in this context.

    ----------------------------------------------------

  • I still like this solution when I need to delete a lot of data,

    https://michaeljswart.com/2014/09/take-care-when-scripting-batches/

  • If I knew I needed a regular purge job based on record creation datetime then I would be looking at whether that datetime column should be a candidate for the clustered index.

    One of the things I miss having moved to other DB platforms is that SQL Server allows a clustered index to be something other than the primary key.

    If the clustered index is on any form of sequential column then there is the possibility of WHERE column BETWEEN.  The downside to that is if the sequential column values rotate.

    At a previous employer we used partition switching to do massive deletes.

    As Jeff mentioned earlier, its worth working out whether it would be more efficient to copy the records you need, switch tables and truncate the original.

    Back in the days before partition switching we used an approach that had physical tables aligned to date/time segments.  There was a view over the top of the tables that did a UNION ALL so the month-end job created a new table, renamed the previous tables and did an sp_refreshview.  Sounds horribly clunky now but it was a reliable process.

  • David.Poole wrote:

    At a previous employer we used partition switching to do massive deletes.

    when you already have partitions, you could simply do a

    TRUNCATE TABLE dbo.big_stuff WITH (PARTITION(1 TO $partition.pf_big_stuff_function(date_key)));

    Or instead of using 1 TO you could just provide a single partition, that you either get by the $partition-syntax or e.g. from a CURSOR which is joining sys.partition_functions and sys.partition_range_values (boundary_value is the partition number).

    If there are some lines at this day that must not be deleted e.g. for legal reasons you could insert them into a #tmp and write them back after the TRUNCATE

    God is real, unless declared integer.

  • Your reply reminded me of a problem I ran into long ago. Although it was Oracle, my thoughts were similar to your questions. It was a big table that became unmanageable with time. I remember the whole IT department getting a classy PTSD when they needed to delete a large amount of data from it, and the only time window in a year for the deletion turned out to be almost too short. It would be a disaster if a bank suspends the cash operations at the opening.

    How could it happen in a bank to use a wrongly designed database system? Did anyone ask the proper questions at the start of the database design? Didn’t anyone expect multimillion-rows in a table at a large bank?

    When I read this article, I thought the same way you did. OK, here are x million rows; we must delete 30m from them. I understand this is an imaginary situation; they don’t need to care about why it happened or the other circumstances. But you can’t answer “How to Delete Large Amounts of Data” without those answers. I got the critique a few times: You spent too long analysing the problem. Maybe they were right, but I’m sure I won’t start it with a simple DELETE, even not with 5000 or 10,000 rows at once.

  • We had the same  problem today.

    I wrote a script that does the following:

    • Starts a transaction
    • retrieves the last id number of the table
    • uses sp_rename to rename the PK, indexes and constraints on the table
    • sp_renames the table
    • executes the create table script for the table to create a new, empty log table
    • reseed the Log_ID to one more than the value retrieved before renaming
    • end the transaction (rollback for testing, commit when it runs clean)

    The script cleaned up 36 million records in a fraction of a second.

  • This is a very relevant article and far too often code is written in such a way that Log space is consumed as a result of attempting to delete too much data in one action.  Knowing the "sweet spot" in how many rows of data can be easily deleted quickly is crucial.  This will differ based on the construction of the table: number of columns, data types, amount of data per row, etc...

    Here is a simple code I use for handling such deletes.

    Declare @CT INT -- variable used to compute number of records to be deleted

    Set @CT = (select count(ID) from table where CreateDate < GetDate()-60)

    While @CT > 0

    Begin

    Delete top 4000 from Table where CreateDate < Getdate() -60

    Set @CT = @CT -4000

    End

    In this scenario, 4000 is the optimal rows of data to be deleted in a single pass.  This simple loop will quickly and easily cycle through the data to be deleted without impacting the LOG file.  Each cycle is fully committed so in case of a ROLLBACK, only 4000 records would need to be recovered via ROLLBACK.

    Yes, I fully understand there are other ways to code this, and that is what makes SQL such a great tool.  The goal is to produce the end result in an efficient manner with minimal impact.  I hope this simple tip helps you become a better at your job.

  • Sorry, but this script isn't really good in my opionion.

    • why are you counting the numbers of rows to delete, this is a unneccessary big read
    • you could write a WHILE 1 = 1 and exit the endless loop with an IF @@ROWCOUNT < 4000 BREAK
    • when there is no index on the CreateDate it would do a lot of table / clustered index scans to find the rows that needs to be deleted
    • 4000 row is really tiny as batchsize - and no, a log file of 100 MB is not big nowadays (~25 kB for every of the 4k deleted rows; most should be much shorter) . If your database is not sitting on a undersized server with to less disk space and/or slow disks and to many databases on it, just increase the log size to a reasonable amount (or let it grow by itself by setting just the max size) and use a few 100k or even 1 mio as batch size. If there are just a few thousand rows to be deleted, than it would run just one loop. If there is really a lot of stuff to delete and your logfile grows to x GB over its usual size, you can shrink it later again.
    • if there is no index on the CreateDate you may want to get the MAX(id) first (in this case you could count too, if you need it for logging) and delete based on the id
    • On current hardware (and with a fitting index) deleting a few 100k rows from a table should take just a few seconds and since you are deleting in the past it should lock just the old pages, so it should be possible to still write to the log table while the delete is running

    God is real, unless declared integer.

  • Thank you for your feedback Thomas, but to clarify, my script was based on the example used in the original article.  At that time, the originating author showed "4000" to be the optimal number of records to delete in his example.  I was simply offering a very valid solution based on his example.  As I stated, there are numerous ways to achieve the same result, some will just have less impact (resource wise, performance wise, etc) than others.

     

  • Thomas Franz wrote:

    when you already have partitions, you could simply do a

    TRUNCATE TABLE dbo.big_stuff WITH (PARTITION(1 TO $partition.pf_big_stuff_function(date_key)));

    I leverage partitions to truncate data on large tables as well, and I love your implementation that dynamically gets the set of partition IDs to be truncate using the partition function.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply