November 26, 2007 at 8:21 am
I am routinely performing transformations and model scoring on tables that are ~150mln records by several hundred columns of extract data that does not change in SQL2005.
When things go back (network errors bad data what not), SQL2005 decides to rollback everything to its original state. This process can take hours and is not needed. I am looking for any of the follwoing:
- Is there some way to turn this rollback behavior off? I can clean up manually afterwards
- Is there some way to either commit say very 5000 records and never rollback more than the last "batch"?
thanks a bunch
November 26, 2007 at 8:46 am
Yep, you can batch your transaction if business rules allow it. You still could end up with stuff to clean up, but I wrote about this before:
http://www.sqlservercentral.com/articles/Miscellaneous/batching/151/
You cannot turn off rollbacks. You need to commit transactions periodically to prevent issues.
November 26, 2007 at 9:03 am
Thanks Steve!
I looked into batch type jobs (see below) but that approach takes me from ~20hrs of processing time to 3-4 days of processing time. Increased processing time, will increase risk of failure. How will your approach change that?
use ;
go
declare @vMin int, @vMax int, @vTotal int;
set @vTotal = 142607203;
set @vMin = 1;
set @vMax = 5000;
while @vmax <= @vTotal
begin
insert into dbo.trans_Part
select
C1 ... c128,
from (nolock, index(iXID))
where xid between @vMin AND @vMax;
set @vMin = @vMax+1;
set @vMax = @vMax + 5000;
end
November 26, 2007 at 9:59 am
Heh... BCP the data out to a file... use Bulk Insert to load it... amazing how fast it can be...
Should never have any "data errors"... data should already be gleaned to not cause errors. To do otherwise is "programming by exception" and will always create a problem.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 26, 2007 at 10:10 am
Thanks for the suggestion, but at 150mln rows by a couple hundred columns it is less practical.
Would A CLR + .NET allow some type of batching, or setting of "batch size + commit"? Anyone?
November 26, 2007 at 11:10 am
You could try moving your processes into SSIS. If it is as simple as your example (copy form one table to another), you could try the SQL Server destination and the OLEDB Destination objects. They allow you to configure batch and commit sizes and do work pretty fast.
Otherwise, you could take your example of processing batches broken up into pieces, but instead of running the stored procedures in a loop, send them into a service broker queue. If you do this, you can configure your broker queue to process groups of records in parallel and still retain control over how many parallel processes can run. Then, by adjusting the batch size and the number of batches running in parallel, you could probably get back to a number far closer to your original performance without having a single, large batch.
As a simple test, run it as a loop in five query windows. In each window, have it only run every 5th batch (use mod or something to get it to skip) and see if your performance improves. Keep in mind that this parallel processing may hit your CPU or disk sub-systems pretty hard.
November 26, 2007 at 2:32 pm
Not sure but maybe this is what you want ?
use ;
go
declare @vMin int, @vMax int, @vTotal int;
set @vTotal = 142607203;
set @vMin = 1;
set @vMax = 5000;
while @vmax <= @vTotal
begin
BEGIN TRANSACTION
insert into dbo.trans_Part
select
C1 ... c128,
from (nolock, index(iXID))
where xid between @vMin AND @vMax;
IF @@ERROR <> 0
ROLLBACK TRANSACTION
ELSE
COMMIT TRANSACTION
set @vMin = @vMax+1;
set @vMax = @vMax + 5000;
end
I would definitelly follow Jeff's bcp advice if you can...
* Noel
November 26, 2007 at 3:39 pm
Pieter Huiberts (11/26/2007)
Thanks for the suggestion, but at 150mln rows by a couple hundred columns it is less practical.Would A CLR + .NET allow some type of batching, or setting of "batch size + commit"? Anyone?
Wait a minute... let's back the performance pony up a bit. Are you saying that your staging table is an exact copy of everything that the target table will eventually have in it??? :blink: I mean it has ALL of the rows whether they are new or updated?
If that's true... then why don't you just do a "rename"... takes, oh, about 65 MILLI-SECONDS to do that... 😉
And, if you think a CLR is going to be more practical or effecient, go ahead... when you're done testing it for performance, we'll get back to doing it the right way 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
November 26, 2007 at 3:49 pm
The number of rows does not change at this stage, I need to score a complete universe. What does change is I pull a subset of the columns (50-70%) and modify their datatype and/or categorize them in the process. Afterwards, I take the transformed results and use them in model scoring. Those types of queries take long and when they brake things rollback. I typically use CREATE TABLE followed by INSERT INTO SELECT. This at least permits me to see progress compared to a SELECT INTO statement.
So, no I am not duplicating data. The scoring script is essentially weighing of values via a case statement and sum those results in 1 value. Picture a 100 case statements with plus signs connecting them into 1 single outcome.
All I want is some trick to avoid the lengthy rollback. I don't need it. I can clean up the malformed results myself.
November 27, 2007 at 2:20 am
The only trick available to avoid lengthy rollbacks, is to not have lengthy transactions.
So what you're looking for is some way of slicing up the total work into many smaller transactions, and make those slices as lean and efficient as possible.
You might want to check out what can be gained with SELECT INTO instead of create table and then insert.. The gain is in the logging, if any, since you can have select into to do 'minimally logging' - ie only log the allocated extents instead of the entire rows as insert will. If volumes are large, this may be one area that can be tuned.
/Kenneth
November 27, 2007 at 8:07 pm
What does change is I pull a subset of the columns (50-70%) and modify their datatype and/or...
That would be part of the reason why things take so long... why is it necessary to change the datatype? It also opens a source of rollbacks if the data happens to be incompatible with the changed datatype.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 28, 2007 at 4:57 pm
noeld (11/26/2007)
Not sure but maybe this is what you want ?use ;
go
declare @vMin int, @vMax int, @vTotal int;
set @vTotal = 142607203;
set @vMin = 1;
set @vMax = 5000;
while @vmax <= @vTotal
begin
BEGIN TRANSACTION
insert into dbo.trans_Part
select
C1 ... c128,
from (nolock, index(iXID))
where xid between @vMin AND @vMax;
IF @@ERROR <> 0
ROLLBACK TRANSACTION
ELSE
COMMIT TRANSACTION
set @vMin = @vMax+1;
set @vMax = @vMax + 5000;
end
I would definitelly follow Jeff's bcp advice if you can...
This works best if the index you're using is the Clustered index... Otherwise you will be doing multiple table scans to find all of the data (which will be VERY inefficient). Also - 5000 is usually too small a chunk (it's going to slow the process down somewhat), for most processes. I'd start around 50,000, and perhaps work up from there, to find the ideal size for your process.
I understand that data cleanup can something be a problem, but what exactly is causing the rollback? Is there no way to isolate what the issue is?
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply