Deleting a larg amount of data

  • Here's the issue, we have a database that is 119GB. What we want to do is delete old data that is no longer in use in order to help performance. The three largest tables are the eventlog (7 million rows), Respondents (6 million rows) and the largest is the ResponseProperties (17million rows) and of course there is an assortment of smaller tables that are related to them through foreign keys constraints.

    To do a straight Delete would take too long and take up too much resources on our production server. We have a space of maybe 4 hours in order to accomplishe this. I need ideas that have a low risk factor but can be down quickly. No, Truncate table won't work because we need to retain the new data. Any help would be great.

     

    Thanks

  • You could divide the delete in a set of smaller batches.

    pseudocode

    SELECT  1

    WHILE @@ROWCOUNT>0

    BEGIN

    DELETE FROM test where id in (SELECT TOP 100 from test where deletecondition=true)

    WAITFORDELAY ...

    END

  • That seems to be a lengthy way to do it. What would you consider the right size for the batches?

  • I have succesfully archived 120 GB out of a database in the past using a very similar method to the above while the server was in production, none noticed any performance problems.  But in mine case it was both inserting into another database and then deleting, maintaining FK relationships while doing it.  I had 5 tables that had to have the same keys archived as one batch.

    I tried to time it so that each batch of archives took about 15 seconds, then gave it 5 seconds to let any user transactions complete.  This all depends on how busy your system is of course, those numbers might have to be reversed.  The beauty of doing them in small batches with declared transactions is that you can stop it in the middle and be ensured you have FK relationships intact.

  • So this was onging for a few days then I take it? Can you give me more detail on the exact implmentation you used to do this. This sounds like it may be a good option for us.

  • I had two versions of the code, noe that ran during the day with the delay, and another one that did larger batches without the delay for running at night after any batch processing was done.

    Not sure how much more I can really explain than I have, but I will try.

    I copied the PK of the main table into a temp table for the top 5000 records, starting with the oldest records in the table.

    Looped trough batches of 100 of the 5000, with the waitfor at the end of it.

    Inside the loop would:

    Begin  a transaction, insert into the archive table with a select based on the 100 set for that batch.  Would do that for the 5 tables involved.  Then would delete the same records (how this is implemented, and in which order obviously depends on the relationship between your tables).

    An error in any of these inserts or deletes would roll back that entire batch of 100, then go on to the next batch.  This worked for us,  but could be expanded on depending on what error, i.e. if you run out of disk space there be no reason to continue.

    Because I needed to be able to restore from the archive tables I also kept track of some other information, like the date ranges for each batch archived, with a FK from the date range table to the archived tables so that I could restore individual batches.

    The procedure with the delay I ran manually during the day since no archiving had been done on this server since day one, it was in dire need of being cleaned up.  After the initial archiving had caught up it was simply run as one batch of 5000 every night (that amounted to about 1.5 GB archived total among the 5 tables), with a rolling date range set by the business needs.  Because of the rules the date range only changed on the first of the month, so after the first 5-6 days of the month it was caught up, but the job would run every night in case we had had more busy months in the past it would allways get caught up.

    The job when there was no records to archive took about 5 seconds to run, with a full 5000 about 3 hours.

    Hope this gives you some ideas for how to do this.  It helps a lot if you have a test machine to fine tune the procedures on, but having the relatively small transactions makes it possible to stop the entire process without dataloss, and without a long rollback.  The person that had had that job before me had tried to make one transaction around the entire thing, which did not work well....  Also running it as 1 transaction per primary key took too long (on that system, it had a horrible data structure).

     

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply