How do you archive data?

  • Hello,

    I am looking for a way to improve our data archival process. Right now we have the following setup:

    ServerA (Production):

    prod_db (Full Recovery)

    staging_db (Simple Recovery)

    ServerB (Archive):

    archive_db (Simple Recovery)

    staging_db (Simple Recovery)

    1) Copy production data from ServerA.prod_db to ServerA.staging_db.

    2) Transfer data from ServerA.staging_db to ServerB.staging_db

    3) Copy data from ServerB.staging_db to ServerB.archive_db

    4) Purge production data (using cursors to avoid blocking table)

    This strategy has been used for years, and given the improvements made in 2005 & 2008, I'm under the impression that there is some new feature (probably SSIS) that would allow for a better process - however I'm having trouble finding it. I am curious to hear how other people that have archive servers handle their archive. With our strategy we must maintain the following requirements:

    1) Minimal locking to production

    2) Minimal performance impact to production

    Ideally we'd also like minimal logging, but it's not required.

    I am pushing a partitioning strategy, but so far I have a problem with convincing the rest of the team that the extra maintence is worth it. :unsure:

  • I just copy directly from prod to archive... no staging. I do it in small enough chunks where I don't need a cursor to prevent blocking.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'd go prod->staging (Serverb) -> archive.

    I do this so that if there are issues in data quality or a need to cleanse, you can handle things in there. SSIS can do this, and temporarily store off data, move error rows, etc. I like a separate staging database as a "just in case" and temp workspace, even with SSIS, but SSIS handles things pretty well and you could add a few tables to the archive database to hold errors and issue data.

    If you chunk data, and I'm sure Jeff could give you some good routines that wouldn't block, you won't have issues on prod.

  • Why don't you use backups of prod_db and restoring them to the archive server? or you mean that you actually compare and copy only the new data from ServerA.prod_db to ServerA.staging? Have you ever considered any of database mirroring or transaction log backup and transfer to the archive server?

    At my company we don't maintain the live archive copy of production database if there are no applications that need to use it. We keep backups for some period of time and in case we need any data from there we just restore the appropriate backup and then look into it. What are the advantages of keeping the live archive copy as you mentioned?

    Thanks!

  • shirmanov (12/11/2008)


    Why don't you use backups of prod_db and restoring them to the archive server? or you mean that you actually compare and copy only the new data from ServerA.prod_db to ServerA.staging? Have you ever considered any of database mirroring or transaction log backup and transfer to the archive server?

    At my company we don't maintain the live archive copy of production database if there are no applications that need to use it. We keep backups for some period of time and in case we need any data from there we just restore the appropriate backup and then look into it. What are the advantages of keeping the live archive copy as you mentioned?

    Thanks!

    Thanks for your suggestion. However this is part of the business requirements - we must retain data for a certain period of time. We have a completely separate strategy for disaster recovery.

  • Steve/Jeff thank your for your replies.

    Jeff, I would be interested in learning more about the "chunk" routines.

    Also, alternatively do you think this would be a good time to try out snapshot isolation? The data we are archiving are purely inserts - it is never updated, and the only deletes that occur are from this archive process.

  • Hi

    Could u tell me about the generic criteria of the rows to archive? Is it about date criteria, flags criteria

    And do u use rowguids or timestamps to mark your ins,upd,dels,... or something like it?

  • Gabe (12/11/2008)


    Jeff, I would be interested in learning more about the "chunk" routines.

    No problem... Skull Killer is right though... it would be handy if you could describe what the archive criteria are. Best way to do that might be to post the CREATE statement for one of the tables you wish to archive data from. Please be sure to include all indexes and the Primary Key constraint. Then identify the columns that would form the criteria for the archival process.

    The "chunk" routines are geared to archive chunks of data that are all logically real close to each other... the big bonus comes into play when the chunks of rows are not only logically close, but are physically close (according to the clustered index), as well. That's why I need to know what the indexes are on the table as well as what the table looks like.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi there

    Independent of the criteria of the archive, u can use a statement like this to delete and transfer the deleted rows into another table.

    DELETE TOP (10)

    FROM teste

    OUTPUT Deleted.* into teste2

    (See Books Online)

    I assume u have at least sql server 2005. This way, in one single transaction u can delete from the source table(s) and output it into another table (the stage table). U can then transfer with SSIS (recommendation) the data into wherever you want. I also recommend that u do the deletes with rowsize limit and with table hint ROWLOCK, to minimize the locks to the production tables. It will cost u performance, but it will serve production lock minimization. When u transfer in SSIS I recommend that in select u use (nolock) table hints and tablelocks in all other statements (at least in the stage_tables). It will reduce resource usage since usually a del, upd, ins starts with row locks and then it goes page locks if the volume of data is big. To do it I will provide a sample code.

    DECLARE @batchsize INTEGER,

    @rowcount INTEGER

    SELECT @batchsize = 100000,

    @rowcount = @batchsize

    WHILE @rowcount = @batchsize

    BEGIN

    DELETE TOP (@batchsize)

    FROM teste

    OUTPUT Deleted.* into teste2

    SET @rowcount=@batchsize

    END

    It will work with different databases in the same server and probably between different servers but I have not tested it.

    The answer to my initial question will probably not matter with this solution, since u can put whatever conditions u want in the delete statement. If there is a problem with the output table (connection problems, ...) it will not delete any data since the statement will fail.

    Hope it helps

  • I will post requested scripts as soon as I get a free moment.

    Skull Killer, I think your sample code will create an infinite loop - but I get the point.

  • Yes it will. Inside the while it should be

    set @rowcount = @@rowcount

    a simple word that make the diference

  • This is what the code should look like:

    DECLARE @batchsize INTEGER,

    @rowcount INTEGER;

    SET @batchsize = 100000;

    SET @rowcount = @batchsize;

    WHILE @rowcount = @batchsize

    BEGIN

    DELETE TOP (@batchsize)

    FROM teste

    OUTPUT Deleted.* into teste2;

    SET @rowcount = @@rowcount;

    END

  • Boy, that OUTPUT clause is nice. This is much cleaner than anything I've written in the past.

  • Steve Jones - Editor (12/12/2008)


    Boy, that OUTPUT clause is nice. This is much cleaner than anything I've written in the past.

    Man, I gotta agree with ya there. Forms it's own transaction and everything. Beats the heck out of the way we had to do it in SQL Server 2000.

    Hey, just as a bit of a reminder for everyone, part of the reason to do the deletes in batches is to give the system some breathing room for other processes. Delete's like this will slam at least 1 CPU into the wall and when that happens, all other processes have to wait. Even though the deletes are done in batches, there's no time in between the batches and if you're gonna do it that way, you might as well just make it one big ol delete.

    My recommendation is to turn it into an archive "crawler" based on how many rows you expect to delete in a run and your "frozen CPU" SLA... and then have a "breathing" delay between batches formed by the loop... like this...

    --===== Declare local variables

    DECLARE @BatchSize INT, --How many rows to delete at once

    @RowCount INT --How many rows got deleted

    --===== Set the desired batch size and prime the row counter

    -- for the WHILE loop

    SELECT @BatchSize = 10000,

    @RowCount = @BatchSize

    --===== The WHILE equation keeps the last batch from

    -- being a "Zero" batch and still allows for the

    -- final partial batch

    WHILE @RowCount = @BatchSize

    BEGIN

    --===== Let the system "breath" so other processes can run

    WAITFOR DELAY '00:00:05'

    DELETE TOP (@BatchSize)

    FROM dbo.SourceTable

    OUTPUT Deleted.* INTO dbo.SourceTableArchive

    WHERE SomeDateColumn < GETDATE()-120 --Anything older than 120 days get's archived

    --===== Remember the rowcount for the WHILE loop

    SELECT @RowCount = @@ROWCOUNT

    END

    ... without the delay, there's no benefit for the loop... just lock the table and do the whole thing at once.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Good idea on the delay.

    Even without it, I've seen better performance and some queries still slipping into the system as each transaction completes, but you're right, it does still slam the system pretty hard.

    I'd say you have to experiment with what batch size and delay work well for you. I might make those paramaters or variables in the script so you can adjust them as needed.

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply