why we should delete old data and what could be the chunk size for delete 2.5 million records

  • Hi,

    I am Using sql server 2008 R2. my database size is 1 TB and 13% free space left.

    everyday 2.5 millions records are inserting into the table and every second 5000 records are inserting. currently, we have 200 days worth of data. we would like to purge data older than 30 days. Could you please let me know k ow the below;

    1. why we should delete older data(Example 2014 years data) first?

    2. What could be the chunk size for delete 2.5 millions records.

    Could you please kindly advice ASAP?

  • somadattamallick (4/18/2016)


    1. why we should delete older data(Example 2014 years data) first?

    I do not see any compelling reason to do so from a technical point. From a logical point, I guess that the intermediate results make more sense - everyone understands having data from e.g. October 13th, 2015 until now at one point during the purge process; having data from February 12th 2014 to August 8th 2014 plus data from March 20, 2016 until now is less intuitive.

    There might be some performance differences but I would surprised if it's more than a tiny fraction of a percent.

    2. What could be the chunk size for delete 2.5 millions records.

    No simple answer to that. Test on a test system as similar as possible to production, if possible. Otherwise, monitor the process closely and be ready to step in and adjust.

    Make sure to include CHECKPOINT statements in the process, and make sure to not do everything in a single transaction. (If you have to delete data from related tables, then do make that a transaction - but do not make the transaction bigger than a single chunk).

    Run the purge during low usage hours, and increase the frequency of log backups during the process as you will generate a lot of extra log records.

    And monitor. Very closely. Make sure you can abort the process if needed.

    (Oh, and probably the most obvious: have a second eye look at your code to make sure you are not accidentally reversing the logic and deleting only the most recent data and keeping the old stuff. And have a backup ready in case you still have it wrong).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • In addition to what Hugo has stated, you may consider looking into partitioning your table by date (month) if you can. This requires Enterprise Edition, but it would give you the ability to remove your 2.5 million pretty easily.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (4/20/2016)


    In addition to what Hugo has stated, you may consider looking into partitioning your table by date (month) if you can. This requires Enterprise Edition, but it would give you the ability to remove your 2.5 million pretty easily.

    That is absolutely a great suggestion to make the purge job a lot easier in the future, so this is a great addition.

    It will not help for the current task, though.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Yes, I should have been more clear. This will not help w/ the existing delete, it was a suggestion to make the operation much simpler in the future.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (4/20/2016)


    In addition to what Hugo has stated, you may consider looking into partitioning your table by date (month) if you can. This requires Enterprise Edition, but it would give you the ability to remove your 2.5 million pretty easily.

    Not quite true. Properly built partitioned views (available in Express, Standard, and Enterprise editions) are just as effective (and more so, in some cases) as partitioned tables. They just require a different thought process and a small amount of slightly more complicated code.

    As to how can they be as or more effective?

    1. Older "partitions" (tables) can have different indexing to support frequent lookups while newer "partitions" (tables) can be indexed differently to support more rapid inserts and updates.

    2. Done properly, you can duplicate the ability to switch large volumes of data in and out of the partitioned view just as surely as using SWITCH but without the headaches of needing to duplicate the target before switching out (for example).

    3. Done properly, both partitioned tables and partitioned views support online piecemeal restores.

    4. Unlike partitioned tables, which eventually require all file groups to be restored before you can take a backup during a DR restore effort, properly done partitioned views (1 database per partition even if it holds only one table) will allow you to do "true" partial restores. This is especially important if you copy prod data to a dev box. You don't necessarily want to copy a terra-byte or two... you just want to copy the last 2 or 3 months worth. Incredibly difficult (if even possible... I've not found a way yet) to do with partitioned tables... ultra easy to do with partitioned views.

    5. Partitioned views are be one way to overcome database size limitations in the Express Edition (although I've not personally tried it). (https://ianmosleydatastrategy.com/2013/05/15/sql-express-larger-than-10gb/)

    6. Partitioned Views can be "distributed" across multiple databases (and I've confirmed that by having done it in the past) whereas Partitioned Tables cannot (according to what I've read in BOL... haven't tried because I'm getting ready to scrap my partitioned tables in favor of partitioned views). As with File Groups, that also means that you can put your older, less frequently accessed data on less expensive slower disks and keep your current stuff on faster more expensive disks.

    The only "problems" with partitioned views that seem to be a serious concern for some folks are 1) that you suffer a limit as to the number of partitions you can have compared to partitioned tables but, there are ways around even that nuance if you put your mind to it and 2) Identity and computed columns need a little help because inserts cover all columns. Of course, there are workarounds for that problem, as well (Instead of Trigger can work a treat here and the trigger can be placed on the same partitioned view that is used for selects).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Good stuff Jeff. I never worked with a DB large enough to really consider partitioning until after 2005 when table partitioning came out and I've worked primarily in Enterprise Edition since then. How easily we get into our little boxes and forget what's outside of them.

    I've been working in Standard Edition for the last year and I've recently had a to spin up a table for some BI/reporting that has 50mm rows. I had to use indexed views to get some of our queries to perform well, but I never really considered old school partitioned views. I may have to revisit this as an option as our solution has some scaling to do yet.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Sometimes it is quicker and easier to copy out the data you want to keep, empty the table using truncate, and then put the data you want to keep back in. You can't use truncate on the table if other tables reference it using foreign keys. You would also need to arrange some system downtime otherwise you could end up losing data. Again, this is something to test on your test system before you go anywhere near production.

  • If the table is clustered first on that datetime, the purge is easy and it should have very minimal effect on current queries.

    If it's clustered differently, you have a more difficult task. If it's clustered on identity, you can use workarounds to still get a good delete process, it's just not quite as clean (and it's very likely the table isn't giving best performance overall, but that's a separate issue).

    Edit: That is, the critical thing for these deletes is how the table is clustered, and how many nonclustered indexes it has, if it's a lot. For only 2.5M rows a day, for a total of 30 days, you very likely don't need to partition this particular table.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I delete in batches of 10000 or so. Perhaps testing different sizes for performance.

    You could check the time of day if you only want it to run during off-hours, exit after 8 am, then schedule to start again the next day.

    Experiment with batch size & delay time to minimize performance problems for other users.

    For example:

    Next_Delete:

    /* small batch to reduce blocking, and can be stopped

    & restarted any time */

    delete top(10000) from MyTable

    where OrderDate < getdate() - 120 -- assumes an index on OrderDate

    if @@rowcount = 0 goto Finished_Update -- NO MORE

    waitfor DELAY '00:00:02' -- 2 seconds break to reduce blocking

    goto Next_Delete

    Finished_Update:

  • John Rowan (4/21/2016)


    Good stuff Jeff. I never worked with a DB large enough to really consider partitioning until after 2005 when table partitioning came out and I've worked primarily in Enterprise Edition since then. How easily we get into our little boxes and forget what's outside of them.

    I've been working in Standard Edition for the last year and I've recently had a to spin up a table for some BI/reporting that has 50mm rows. I had to use indexed views to get some of our queries to perform well, but I never really considered old school partitioned views. I may have to revisit this as an option as our solution has some scaling to do yet.

    Thanks for the feedback, John. Just to be sure, though, partitioning is NOT a performance enhancement. Yes, sometimes it works out that way and sometimes really bad code can actually take advantage of "partition elimination" (regardless of which type of partitioning is in play) but partitioning actually slows down most queries a bit because of things like there being multiple b-Trees to go through. A properly written query on a properly indexed monolithic table will almost always beat the same query on an equally properly indexed partitioned table.

    The BIG reasons to partition are to...

    1. Enable piecemeal restores (online and otherwise).

    2. Reduce index maintenance (especially for "WORM" tables where older partitions can be sequestered to Read_Only file groups).

    3. Reduce backups for the same reasons that index maintenance can be reduced.

    4. Put older, less frequently accessed data (usually the biggest tables in a database) on less expensive hardware while still treating the table/view as a single unit.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply