December 9, 2008 at 12:32 pm
Hello,
I am looking for a way to improve our data archival process. Right now we have the following setup:
ServerA (Production):
prod_db (Full Recovery)
staging_db (Simple Recovery)
ServerB (Archive):
archive_db (Simple Recovery)
staging_db (Simple Recovery)
1) Copy production data from ServerA.prod_db to ServerA.staging_db.
2) Transfer data from ServerA.staging_db to ServerB.staging_db
3) Copy data from ServerB.staging_db to ServerB.archive_db
4) Purge production data (using cursors to avoid blocking table)
This strategy has been used for years, and given the improvements made in 2005 & 2008, I'm under the impression that there is some new feature (probably SSIS) that would allow for a better process - however I'm having trouble finding it. I am curious to hear how other people that have archive servers handle their archive. With our strategy we must maintain the following requirements:
1) Minimal locking to production
2) Minimal performance impact to production
Ideally we'd also like minimal logging, but it's not required.
I am pushing a partitioning strategy, but so far I have a problem with convincing the rest of the team that the extra maintence is worth it. :unsure:
December 9, 2008 at 8:05 pm
I just copy directly from prod to archive... no staging. I do it in small enough chunks where I don't need a cursor to prevent blocking.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 9, 2008 at 9:34 pm
I'd go prod->staging (Serverb) -> archive.
I do this so that if there are issues in data quality or a need to cleanse, you can handle things in there. SSIS can do this, and temporarily store off data, move error rows, etc. I like a separate staging database as a "just in case" and temp workspace, even with SSIS, but SSIS handles things pretty well and you could add a few tables to the archive database to hold errors and issue data.
If you chunk data, and I'm sure Jeff could give you some good routines that wouldn't block, you won't have issues on prod.
December 11, 2008 at 8:06 am
Why don't you use backups of prod_db and restoring them to the archive server? or you mean that you actually compare and copy only the new data from ServerA.prod_db to ServerA.staging? Have you ever considered any of database mirroring or transaction log backup and transfer to the archive server?
At my company we don't maintain the live archive copy of production database if there are no applications that need to use it. We keep backups for some period of time and in case we need any data from there we just restore the appropriate backup and then look into it. What are the advantages of keeping the live archive copy as you mentioned?
Thanks!
December 11, 2008 at 8:18 am
shirmanov (12/11/2008)
Why don't you use backups of prod_db and restoring them to the archive server? or you mean that you actually compare and copy only the new data from ServerA.prod_db to ServerA.staging? Have you ever considered any of database mirroring or transaction log backup and transfer to the archive server?At my company we don't maintain the live archive copy of production database if there are no applications that need to use it. We keep backups for some period of time and in case we need any data from there we just restore the appropriate backup and then look into it. What are the advantages of keeping the live archive copy as you mentioned?
Thanks!
Thanks for your suggestion. However this is part of the business requirements - we must retain data for a certain period of time. We have a completely separate strategy for disaster recovery.
December 11, 2008 at 8:25 am
Steve/Jeff thank your for your replies.
Jeff, I would be interested in learning more about the "chunk" routines.
Also, alternatively do you think this would be a good time to try out snapshot isolation? The data we are archiving are purely inserts - it is never updated, and the only deletes that occur are from this archive process.
December 11, 2008 at 1:29 pm
Hi
Could u tell me about the generic criteria of the rows to archive? Is it about date criteria, flags criteria
And do u use rowguids or timestamps to mark your ins,upd,dels,... or something like it?
December 11, 2008 at 7:52 pm
Gabe (12/11/2008)
Jeff, I would be interested in learning more about the "chunk" routines.
No problem... Skull Killer is right though... it would be handy if you could describe what the archive criteria are. Best way to do that might be to post the CREATE statement for one of the tables you wish to archive data from. Please be sure to include all indexes and the Primary Key constraint. Then identify the columns that would form the criteria for the archival process.
The "chunk" routines are geared to archive chunks of data that are all logically real close to each other... the big bonus comes into play when the chunks of rows are not only logically close, but are physically close (according to the clustered index), as well. That's why I need to know what the indexes are on the table as well as what the table looks like.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 12, 2008 at 2:57 am
Hi there
Independent of the criteria of the archive, u can use a statement like this to delete and transfer the deleted rows into another table.
DELETE TOP (10)
FROM teste
OUTPUT Deleted.* into teste2
I assume u have at least sql server 2005. This way, in one single transaction u can delete from the source table(s) and output it into another table (the stage table). U can then transfer with SSIS (recommendation) the data into wherever you want. I also recommend that u do the deletes with rowsize limit and with table hint ROWLOCK, to minimize the locks to the production tables. It will cost u performance, but it will serve production lock minimization. When u transfer in SSIS I recommend that in select u use (nolock) table hints and tablelocks in all other statements (at least in the stage_tables). It will reduce resource usage since usually a del, upd, ins starts with row locks and then it goes page locks if the volume of data is big. To do it I will provide a sample code.
DECLARE @batchsize INTEGER,
@rowcount INTEGER
SELECT @batchsize = 100000,
@rowcount = @batchsize
WHILE @rowcount = @batchsize
BEGIN
DELETE TOP (@batchsize)
FROM teste
OUTPUT Deleted.* into teste2
SET @rowcount=@batchsize
END
It will work with different databases in the same server and probably between different servers but I have not tested it.
The answer to my initial question will probably not matter with this solution, since u can put whatever conditions u want in the delete statement. If there is a problem with the output table (connection problems, ...) it will not delete any data since the statement will fail.
Hope it helps
December 12, 2008 at 8:19 am
I will post requested scripts as soon as I get a free moment.
Skull Killer, I think your sample code will create an infinite loop - but I get the point.
December 12, 2008 at 9:45 am
Yes it will. Inside the while it should be
set @rowcount = @@rowcount
a simple word that make the diference
December 12, 2008 at 11:57 am
This is what the code should look like:
DECLARE @batchsize INTEGER,
@rowcount INTEGER;
SET @batchsize = 100000;
SET @rowcount = @batchsize;
WHILE @rowcount = @batchsize
BEGIN
DELETE TOP (@batchsize)
FROM teste
OUTPUT Deleted.* into teste2;
SET @rowcount = @@rowcount;
END
December 12, 2008 at 12:14 pm
Boy, that OUTPUT clause is nice. This is much cleaner than anything I've written in the past.
December 13, 2008 at 5:10 pm
Steve Jones - Editor (12/12/2008)
Boy, that OUTPUT clause is nice. This is much cleaner than anything I've written in the past.
Man, I gotta agree with ya there. Forms it's own transaction and everything. Beats the heck out of the way we had to do it in SQL Server 2000.
Hey, just as a bit of a reminder for everyone, part of the reason to do the deletes in batches is to give the system some breathing room for other processes. Delete's like this will slam at least 1 CPU into the wall and when that happens, all other processes have to wait. Even though the deletes are done in batches, there's no time in between the batches and if you're gonna do it that way, you might as well just make it one big ol delete.
My recommendation is to turn it into an archive "crawler" based on how many rows you expect to delete in a run and your "frozen CPU" SLA... and then have a "breathing" delay between batches formed by the loop... like this...
--===== Declare local variables
DECLARE @BatchSize INT, --How many rows to delete at once
@RowCount INT --How many rows got deleted
--===== Set the desired batch size and prime the row counter
-- for the WHILE loop
SELECT @BatchSize = 10000,
@RowCount = @BatchSize
--===== The WHILE equation keeps the last batch from
-- being a "Zero" batch and still allows for the
-- final partial batch
WHILE @RowCount = @BatchSize
BEGIN
--===== Let the system "breath" so other processes can run
WAITFOR DELAY '00:00:05'
DELETE TOP (@BatchSize)
FROM dbo.SourceTable
OUTPUT Deleted.* INTO dbo.SourceTableArchive
WHERE SomeDateColumn < GETDATE()-120 --Anything older than 120 days get's archived
--===== Remember the rowcount for the WHILE loop
SELECT @RowCount = @@ROWCOUNT
END
... without the delay, there's no benefit for the loop... just lock the table and do the whole thing at once.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 14, 2008 at 3:06 pm
Good idea on the delay.
Even without it, I've seen better performance and some queries still slipping into the system as each transaction completes, but you're right, it does still slam the system pretty hard.
I'd say you have to experiment with what batch size and delay work well for you. I might make those paramaters or variables in the script so you can adjust them as needed.
Viewing 15 posts - 1 through 15 (of 25 total)
You must be logged in to reply to this topic. Login to reply