How do you archive data?

Question

How do you archive data?

Gabriel P

SSCarpal Tunnel

Points: 4289
More actions
December 9, 2008 at 12:32 pm

#383540

Hello,
I am looking for a way to improve our data archival process. Right now we have the following setup:
ServerA (Production):
prod_db (Full Recovery)
staging_db (Simple Recovery)
ServerB (Archive):
archive_db (Simple Recovery)
staging_db (Simple Recovery)
1) Copy production data from ServerA.prod_db to ServerA.staging_db.
2) Transfer data from ServerA.staging_db to ServerB.staging_db
3) Copy data from ServerB.staging_db to ServerB.archive_db
4) Purge production data (using cursors to avoid blocking table)
This strategy has been used for years, and given the improvements made in 2005 & 2008, I'm under the impression that there is some new feature (probably SSIS) that would allow for a better process - however I'm having trouble finding it. I am curious to hear how other people that have archive servers handle their archive. With our strategy we must maintain the following requirements:
1) Minimal locking to production
2) Minimal performance impact to production
Ideally we'd also like minimal logging, but it's not required.
I am pushing a partitioning strategy, but so far I have a problem with convincing the rest of the team that the extra maintence is worth it. :unsure:

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 1

I just copy directly from prod to archive... no staging. I do it in small enough chunks where I don't need a cursor to prevent blocking.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Jones - SSC Editor SSC Guru Points: 736188 More actions · Answer 2

I'd go prod->staging (Serverb) -> archive.

I do this so that if there are issues in data quality or a need to cleanse, you can handle things in there. SSIS can do this, and temporarily store off data, move error rows, etc. I like a separate staging database as a "just in case" and temp workspace, even with SSIS, but SSIS handles things pretty well and you could add a few tables to the archive database to hold errors and issue data.

If you chunk data, and I'm sure Jeff could give you some good routines that wouldn't block, you won't have issues on prod.

shirmanov Newbie Points: 3 More actions · Answer 3

Why don't you use backups of prod_db and restoring them to the archive server? or you mean that you actually compare and copy only the new data from ServerA.prod_db to ServerA.staging? Have you ever considered any of database mirroring or transaction log backup and transfer to the archive server?

At my company we don't maintain the live archive copy of production database if there are no applications that need to use it. We keep backups for some period of time and in case we need any data from there we just restore the appropriate backup and then look into it. What are the advantages of keeping the live archive copy as you mentioned?

Thanks!

Gabriel P SSCarpal Tunnel Points: 4289 More actions · Answer 4

shirmanov (12/11/2008)
Why don't you use backups of prod_db and restoring them to the archive server? or you mean that you actually compare and copy only the new data from ServerA.prod_db to ServerA.staging? Have you ever considered any of database mirroring or transaction log backup and transfer to the archive server?
At my company we don't maintain the live archive copy of production database if there are no applications that need to use it. We keep backups for some period of time and in case we need any data from there we just restore the appropriate backup and then look into it. What are the advantages of keeping the live archive copy as you mentioned?
Thanks!

Thanks for your suggestion. However this is part of the business requirements - we must retain data for a certain period of time. We have a completely separate strategy for disaster recovery.

Gabriel P SSCarpal Tunnel Points: 4289 More actions · Answer 5

Steve/Jeff thank your for your replies.

Jeff, I would be interested in learning more about the "chunk" routines.

Also, alternatively do you think this would be a good time to try out snapshot isolation? The data we are archiving are purely inserts - it is never updated, and the only deletes that occur are from this archive process.

Skull Killer SSCrazy Points: 2504 More actions · Answer 6

Hi

Could u tell me about the generic criteria of the rows to archive? Is it about date criteria, flags criteria

And do u use rowguids or timestamps to mark your ins,upd,dels,... or something like it?

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 7

Gabe (12/11/2008)
Jeff, I would be interested in learning more about the "chunk" routines.

No problem... Skull Killer is right though... it would be handy if you could describe what the archive criteria are. Best way to do that might be to post the CREATE statement for one of the tables you wish to archive data from. Please be sure to include all indexes and the Primary Key constraint. Then identify the columns that would form the criteria for the archival process.

The "chunk" routines are geared to archive chunks of data that are all logically real close to each other... the big bonus comes into play when the chunks of rows are not only logically close, but are physically close (according to the clustered index), as well. That's why I need to know what the indexes are on the table as well as what the table looks like.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Skull Killer SSCrazy Points: 2504 More actions · Answer 8

Hi there

Independent of the criteria of the archive, u can use a statement like this to delete and transfer the deleted rows into another table.

DELETE TOP (10)

FROM teste

OUTPUT Deleted.* into teste2

(See Books Online)

I assume u have at least sql server 2005. This way, in one single transaction u can delete from the source table(s) and output it into another table (the stage table). U can then transfer with SSIS (recommendation) the data into wherever you want. I also recommend that u do the deletes with rowsize limit and with table hint ROWLOCK, to minimize the locks to the production tables. It will cost u performance, but it will serve production lock minimization. When u transfer in SSIS I recommend that in select u use (nolock) table hints and tablelocks in all other statements (at least in the stage_tables). It will reduce resource usage since usually a del, upd, ins starts with row locks and then it goes page locks if the volume of data is big. To do it I will provide a sample code.

DECLARE @batchsize INTEGER,

@rowcount INTEGER

SELECT @batchsize = 100000,

@rowcount = @batchsize

WHILE @rowcount = @batchsize

BEGIN

DELETE TOP (@batchsize)

FROM teste

OUTPUT Deleted.* into teste2

SET @rowcount=@batchsize

END

It will work with different databases in the same server and probably between different servers but I have not tested it.

The answer to my initial question will probably not matter with this solution, since u can put whatever conditions u want in the delete statement. If there is a problem with the output table (connection problems, ...) it will not delete any data since the statement will fail.

Hope it helps

Gabriel P SSCarpal Tunnel Points: 4289 More actions · Answer 9

I will post requested scripts as soon as I get a free moment.

Skull Killer, I think your sample code will create an infinite loop - but I get the point.

Skull Killer SSCrazy Points: 2504 More actions · Answer 10

Yes it will. Inside the while it should be

set @rowcount = @@rowcount

a simple word that make the diference

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 11

This is what the code should look like:

DECLARE @batchsize INTEGER,

@rowcount INTEGER;

SET @batchsize = 100000;

SET @rowcount = @batchsize;

WHILE @rowcount = @batchsize

BEGIN

DELETE TOP (@batchsize)

FROM teste

OUTPUT Deleted.* into teste2;

SET @rowcount = @@rowcount;

END

Steve Jones - SSC Editor SSC Guru Points: 736188 More actions · Answer 12

Boy, that OUTPUT clause is nice. This is much cleaner than anything I've written in the past.

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 13

Steve Jones - Editor (12/12/2008)
Boy, that OUTPUT clause is nice. This is much cleaner than anything I've written in the past.

Man, I gotta agree with ya there. Forms it's own transaction and everything. Beats the heck out of the way we had to do it in SQL Server 2000.

Hey, just as a bit of a reminder for everyone, part of the reason to do the deletes in batches is to give the system some breathing room for other processes. Delete's like this will slam at least 1 CPU into the wall and when that happens, all other processes have to wait. Even though the deletes are done in batches, there's no time in between the batches and if you're gonna do it that way, you might as well just make it one big ol delete.

My recommendation is to turn it into an archive "crawler" based on how many rows you expect to delete in a run and your "frozen CPU" SLA... and then have a "breathing" delay between batches formed by the loop... like this...

--===== Declare local variables

DECLARE @BatchSize INT, --How many rows to delete at once

@RowCount INT --How many rows got deleted

--===== Set the desired batch size and prime the row counter

-- for the WHILE loop

SELECT @BatchSize = 10000,

@RowCount = @BatchSize

--===== The WHILE equation keeps the last batch from

-- being a "Zero" batch and still allows for the

-- final partial batch

WHILE @RowCount = @BatchSize

BEGIN

--===== Let the system "breath" so other processes can run

WAITFOR DELAY '00:00:05'

DELETE TOP (@BatchSize)

FROM dbo.SourceTable

OUTPUT Deleted.* INTO dbo.SourceTableArchive

WHERE SomeDateColumn < GETDATE()-120 --Anything older than 120 days get's archived

--===== Remember the rowcount for the WHILE loop

SELECT @RowCount = @@ROWCOUNT

END

... without the delay, there's no benefit for the loop... just lock the table and do the whole thing at once.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Jones - SSC Editor SSC Guru Points: 736188 More actions · Answer 14

Good idea on the delay.

Even without it, I've seen better performance and some queries still slipping into the system as each transaction completes, but you're right, it does still slam the system pretty hard.

I'd say you have to experiment with what batch size and delay work well for you. I might make those paramaters or variables in the script so you can adjust them as needed.