Need best way to move historical data out

  • Hi,

    I am getting ready to start a project where I am charged with moving out old data from production into a newly created historical DB. We have about 8 tables that are internal audit tables, that are big and full of old data. These tables are barely used and are taking up way too much space and time for maintenance.

    I would like to create a way (SSIS?) to look at the date field in each of the 8 tables and copy out anything older than two years into my newly created history DB. Then deleting the older records from the source DB.

    I don't know if SSIS is the best method to use. If it is, what containers to use to move over data, then how to do delete from source?

    Can I do the mass deletes on my audit source tables without impacting performance/indexes/fragmentation?

  • Hi,

    I had came across similar scenario but different case(moving older that 30 days data from source table).

    Mass deletion using delete command will take huge time and if the date column based on which you are going to move if it has index it will be Faster in terms of performance.

  • SSIS is as good a tool as any. Other methods are available - linked servers, OPENQUERY and so on. You pay your money, you take your choice.

    You'll use a Data Flow task to copy the data across, then an Execute SQL task to delete the data from the source. You'll want to make sure the data has been copied across before it gets deleted. You can run the whole thing as a single transaction in your package (not a good idea since locks will be held for longer than you might be comfortable with), or you can verify that the data has been copied across properly before starting your delete.

    Do all this overnight, or whenever is the quietest time, because it may affect performance. You can copy and/or delete in batches if you're worried about processing too much data in one go. And, as has already been mentioned, make sure the indexes on your tables support this process.

    John

  • Our routines which delete large amounts of data do the following:

    For all records to be delete get the Clustered Index keys into #TEMP table, ordered by clustered index and containing an IDENTITY column.

    (We find the query is often complex - e.g. date range + OK-to-delete-flag, and just deleting batches with that in the WHERE means that the Query Time, before the delete, on each iteration becomes a limiting factor on the operation, perhaps made worse by the index fragmenting during the delete??)

    Delete in batches JOINing the #TEMP table to the actual table using a range of IDENTITY values (incremented each loop iteration by the delete batch size)

    Use WAITFOR to delay a couple of seconds to allow other processes to run

    LOOP

    On critical processes we also measure the elapsed time for the DELETE and if it exceeds a MAX threshold we reduce the batch-size by half. If the elapsed time is quick then we increase the batch size by 20%

    For databases in Full Recovery Mode we set Log Backups to run every minute for the duration of the delete to prevent the delete operation being the cause of log file extensions.

  • krypto69 (9/29/2015)


    These tables are barely used and are taking up way too much space and time for maintenance.

    Are ANY rows that are older than 2 years EVER updated? Also, how many rows per table are we currently talking about and how many rows per month does each table grow?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's my thoughts on how you might do this.

    Use SSIS as your bridge between servers since that's what it does.

    Create stored procedures to do the data selection. Since you're going through tables you ~might~ be able to use a for each sequence container but I'm not sure. You'll have to test sending variables into your stored procedure with dynamic SQL.

    Create another stored procedure to do the data delete.

    If it works - great! You have a reusable package that will parse out your data into historical tables. You can run it on a schedule to keep the active tables clean.

    At least this can be a starting point for you to do some design work. Also consider the structure of your historical tables based on the amount of data. Partitions might also be something that will make sense since the data seems to be date based and doesn't change much.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply