delta processing: alternative to cursor

Question

Post reply

delta processing: alternative to cursor

hbkdba

Ten Centuries

Points: 1224
More actions
July 29, 2003 at 7:07 am

#106124

Hi,
For our datawarehouse we bulk insert files into 'temp tables' which represent a 1:1 representation of the OLTP system. The next step is deltaprocessing between those tables and ods. This is done in three steps (within one stored procedure) 1) updates 2) inserts 3) deletes (which are actually updates)
Step 1 is currently coded as a cursor. I would like to rewrite this as a statement without cursors. Any clues ? Thanks in advance.
Simplified example of the current inserts:
(there are f.i. 7 'keycolumns' and 150 'datacolumns'). Is it possible to simplify the declaration / variable usage in the cursor by using some kind of record structure ?
declare @key1 char(10), @key2 char(10), @key3 char(10), @data1 char(10), @data2 char(10), @data3 char(10)
--this types are only an example of course 😉
declare @commitcount integer
set @commitcount = 0
declare [delta-inserts] cursor fast-forward for
select t.*
from temptable t, odstable o
where
t.key1 = o.key1 and
t.key2 = o.key2 and
t.key3 = o.key3 and (
(isnull(t.data1,0) <> isnull(o.data1,0)) or
(isnull(t.data2,0) <> isnull(o.data2,0)) or
(isnull(t.data3,0) <> isnull(o.data3,0)) )
open [delta-inserts]
fetch next from [delta-inserts] into @key1, @key2, @key3, @data1, @data2, @data3
begin transaction insertings
while @@fetch_status=0
begin
update odstable
set data1= @data1,
set data2= @data2,
set data3= @data3,
set modify_date = @processing_date,
set modify_flag = 'U',
where
key1= @key1,
key2= @key2,
key3= @key3
set @commitcount = @commitcount + 1
if @commitcount = 10000
begin
set @commitcount = 0
commit transaction insertings
end
fetch next from [delta-inserts] into @key1, @key2, @key3, @data1, @data2, @data3
end
close [delta-inserts]
deallocate [delta-inserts]
commit transaction insertings
---
thanks a lot !

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

jpipes SSChampion Points: 12230 More actions · Answer 1

gee I love killing cursors.

Suggestion: use the modern JOIN syntax for readability.



UPDATE o

SET

  o.data1 = t.data1

, o.data2 = t.data2

, o.data3 = t.data3

, o.modify_date = @processing_date

, o.modify_flag = 'U'

FROM odstable o

  INNER JOIN temptable t

    ON t.key1 = o.key1

    AND t.key2 = o.key2

    AND t.key3 = o.key3

WHERE 

(

   ISNULL(o.data1, 0) <> ISNULL(t.data1, 0)

OR ISNULL(o.data2, 0) <> ISNULL(t.data2, 0)

OR ISNULL(o.data3, 0) <> ISNULL(t.data3, 0)

)

Edited by - jpipes on 07/29/2003 08:24:03 AM

SeekQuel SSCommitted Points: 1818 More actions · Answer 2

The WHERE clause can be rewritten as:

...

WHERE CHECKSUM(o.data1,o.Data2,o.Data3)<>CHECKSUM(t.data1,t.Data2,t.Data3)

The CHECKSUM creates a CRC value.

Jeremy Kemp SSCertifiable Points: 6754 More actions · Answer 3

I've been doing exactly the same over the last couple of days - trying to replace a cursor based update process with a set based process.

So far I am not impressed - the set based process is no quicker than the cursor. In fact it is probably going to be slower as there are a number of other things that the cursor based processing is doing that I haven't yet put into the set based processing. Any ideas?

Input file: 42 columns. 8 columns require date validation, 3 columns require currency conversion. 280,000 rows (but can be over 3 million). One column calculated on one of the columns.

Master table: 9 million rows (Same process used on tables of 25 million rows).

Cursor method:

1. Load file into temp table

2. Declare Cursor

3. Loop for each row doin:

- date validation for 8 columns

- currency conversion for 3 columns

- count PK in master table - if exists update else insert

4. Fetch next

5. Tidy up

Set based process:

1. Load file into Temp table

2. Create indexes (found that it improves performace)

3. Update master

set master.col_values = temp.col_values

where master.PrimaryKey = temp.PrimaryKey

4. Insert master

set master.col_values = temp.col_values

where not exists ....

For the date validations in the set based method, I am using User Defined Functions and also for the Currency Conversions.

From what I have read on this site, I thought that set based processing was supposed to be much better (i.e. quicker) than cursor based processing. If this true, what am I doing wrong?

Suggestions welcome.

Jeremy

jpipes SSChampion Points: 12230 More actions · Answer 4

quote:
The WHERE clause can be rewritten as:
...
WHERE CHECKSUM(o.data1,o.Data2,o.Data3)<>CHECKSUM(t.data1,t.Data2,t.Data3)
The CHECKSUM creates a CRC value.

Nice. Very nice.

jpipes SSChampion Points: 12230 More actions · Answer 5

Jeremy,

There may be some things in your SQL that we could look at that might speed things up. But aside from that, speed is not the only problem with CURSORS; they can also have serious memory problems associated with them that a SET-based solution usually doesn't. That said, there ARE times when a cursor will run faster, and there ARE times when a set-based solution will run faster. A lot of it is about predicting the future, as well (I know it sounds corny). If you know ahead of time that the procedure that you just implemented with a cursor will be run every minute on a production system with thousands of users, you're heading for trouble. But if the process is one-time run, or something run off-hours or on small data sets, CURSORS generally won't present a problem. That said, I have been bitten numerous times by cursors that have been implemented in procedures that weren't intended to be run all the time, but "it just turned out that they do"...If you want to post some code, we'll take a look...

Jeremy Kemp SSCertifiable Points: 6754 More actions · Answer 6

Jay,

This is the code for the set based update statement:

declare @row int, @increment int, @start datetime, @update_count int

set @start = getdate()

set @row = 1

set @increment = 5000

set @update_count = 0

set rowcount 0

while @row <= 5000 begin

update [claims_2002] with (tablockx)

set mdl_yr = left(col001,4), vin_cd = col002, sls_cntr = col003, veh_line = col004, plant_cd = col005, maint_dt = dbo.fn_AwsDateCheck(col006), prodn_dt = dbo.fn_AwsDateCheck(col007), wrty_sta = dbo.fn_AwsDateCheck(col008), orig_wrt = dbo.fn_AwsDateCheck(col009), wcc_cd = col011, tot_cost = dbo.fn_AwsClaimLocalCurrencyCost(col026,col012)

,lbr_cost = dbo.fn_AwsClaimLocalCurrencyCost(col026,col013), mtrl_cos = dbo.fn_AwsClaimLocalCurrencyCost(col026,col014), prt_num0 = col015, prt_num_ = col016, prt_num1 = col017, cpsc_6_c = col018, clm_cnt_ = col019, veh_cnt_ = col020

,sls_regi = col021, load_dt = dbo.fn_AwsDateCheck(col022), fcc_auth = col023, rpr_dt = dbo.fn_AwsDateCheck(col024), cust_con = col025, rpr_cntr = col026, milge = col027, mkt_deri = col028, adj_amt = col029, adj_rsn_ = col030

,cond_cd = col031, fcc_bin_ = col032, fcc_milg = col033, fcc_regi = col034, fcc_rpt_ = col035, fcc_term = col036, lbr_hrs = col037, prt_mrku = col038, prt_typ_ = col039, rpr_dlr_ = col040

,rpr_dlr0 = col041, doc_num = col042, funds_ty = dbo.fn_AwsClaimCurrency(col026)

from tempdb..clm02_new

where [claims_2002].clm_key = tempdb..clm02_new.col010

and tempdb..clm02_new.rec_id between @row and (@row + @increment)

set @update_count = @update_count + @@rowcount

set @row = @row + @increment + 1

end

print @update_count

print datediff(second,@start,getdate())

print @update_count/datediff(second,@start,getdate())

Functions:

/* Converts US$ Cost into Local Currecny cost for AWS claims */

CREATE FUNCTION [dbo].[fn_AwsClaimLocalCurrencyCost] (@country_repaired varchar(25), @usd_cost decimal (25,2) )

RETURNS decimal (25,2) AS

BEGIN

return ( select @usd_cost * funds_fa

from claims_fundrep_new

where rpr_cntr = @country_repaired

)

END

/* Checks the validity of date fields from AWS downloads */

CREATE FUNCTION [dbo].[fn_AwsDateCheck] (@date smalldatetime)

RETURNS varchar(25) AS

BEGIN

declare @output_date varchar(25)

if @date > '1950-01-01' and isDate(@date) = 1 set @output_date = @date

else set @output_date = null

return @output_date

END

/* Returns local currency for AWS claims */

CREATE FUNCTION [dbo].[fn_AwsClaimCurrency] (@country_repaired varchar(25) )

RETURNS varchar(25) AS

BEGIN

declare @output_Field varchar(25)

if @country_repaired = 'USA' set @output_field = 'USD'

else select @output_Field = funds_ty

from claims_fundrep_new

where rpr_cntr = @country_repaired

return (@output_field)

END

I'm still working on this so cosmectically it may not be at it's best (such as column names in the temporary tables)

This is a 'weekly' update job although it can run at any time.

Any ideas for improvements?

Jeremy

NPeeters SSChampion Points: 12541 More actions · Answer 7

Jeremy,

If performance really is too slow, drop the Userdefined functions and code everything directly in the SQL Statement.

If you want, you can testdrive, only doing the updates of the function fields in the temp table. This will give you an idea where the biggest delay is.

epurg SSC Rookie Points: 45 More actions · Answer 8

I do the same thing in datawarehouse evironment, but with a few tweeks.

I load my data files into a "working table" that has the same fields as the data file. I then check for changes, new records, and deleted records in this manner. First I get the changed accounts by loading a temp table with the primary key of the records in the production table that have changed along with the changed values. I then use the method that jpipes pointed out (isnull(tbl1.fld1,0) != isnull(tbl2.fld1,0)), but like the checksum idea of SeekQuel.

Anyway, then do a "update from" joining the temp table to the production table using the primary key.

Same process for new and deleted, except the query to find them is a little different. I use the "not in" structure in the where clause to find them. Then insert the new records, and update the deleted records. I update the deleted records setting a "flag" to indicate they are not in the source business system anymore.

This runs well for us, about 20 minutes. It is a 30 million record table checking against a data file of about 1 million records. The server has 2 gig RAM, dual 750mhz Xeons. We have had processes similar to this using a cursor and they take hours.

Not sure why you need the loop in the code you posted, smaller batches w/ commits??? You could do your currency conversions and date validations when you load the temp table. I would be concerned about the UDF's in the update. Not sure how that effects performance.

Hope this helps, would be glad to post an example.

Thanks,

epurg

hbkdba Ten Centuries Points: 1224 More actions · Answer 9

@jpipes: Thanks a lot ! This is exactly what we are looking for. A first test:

two tables containing 1.2 million rows (temp and ods). After changing 47.000 rows on the temp. I've ran the update statement. (server Pentium III 500 Mhz 384Mb an old one to test on) completed in 27 seconds! Comparisons against the cursor statement will also be made.

@SeekQuel: Very nice (the checksum). Does this neat little "trick" offers an important performance gain ? (anyway, it is shorter and more readable as we have tables with 280 columns)

@jeremy-2 Kemp: In our situation the set based solution seems very fast. Remember: the set only works on the dataset results from the joins

update 1..5% of the data

insert 1..5% of the data

instead of sequentially digging through the whole 100%

jpipes SSChampion Points: 12230 More actions · Answer 10

hbkdba, Great news! Give us an update when you do a performance comparison on the cursor...

Jeremy,

I have a feeling those UDFs might be the culprit, like NPeeters mentioned. Any chance to see an execution plan?

hbkdba Ten Centuries Points: 1224 More actions · Answer 11

Is the following statement a valid example of the checksum-statement ? I've read about the * on msdn (-which does not mention the abality to use the table-qualifications as t. and o.) but am unable to test because I have a couple of days off right now

UPDATE o

SET

o.data1 = t.data1

, o.data2 = t.data2

, o.data3 = t.data3

, o.modify_date = @processing_date

, o.modify_flag = 'U'

FROM odstable o

INNER JOIN temptable t

ON t.key1 = o.key1

AND t.key2 = o.key2

AND t.key3 = o.key3

WHERE

(

checksum(o.*) <> checksum(t.*)

)

Thanks !

benchmark tests cursor vs set based are planned thursday next week (after a short vacation 🙂

Matthew Galbraith SSC Eights! Points: 995 More actions · Answer 12

Just to chime in on the UDF issue... I think they are amazingly cool, but should be used in select statements with great caution.

We were working on an application here and the other developer, who comes from a VB/CGI/web-development world, coded a lot of the logic into UDFs, then put the UDFs in the select and where clauses... You have to keep in mind that if the UDF does a select off of a large table to return it's result, then you put it into a select or especially a where clause, it has to execute one time for every row that the statement is going to return. With caching, it might not generate TOO much IO, but if your result set is large and it adds a tiny bit of time to each row, you can drag an application to it's knees waiting for it to return.

In our case, the developer's first attempt required over 10 to 15 minutes or more to return a few hundred rows out of a table with around one million (this was for a web application). I later recoded it without using the UDFs and got the entire thing to run in under three seconds. To be fair to UDFs, that wasn't the only problem the stored procedure had. This is an extreme case, but I think it makes the point.

Matthew Galbraith

ps: I do use UDFs, but usually only in cases where I can't do it any other way, or as a shortcut when I'm writing one-off ad hoc queries; I eschew them almost completely in where clauses though I do sometimes use them as 'smart' views in joins and every once in a while will put a simple one in the select clause.

Jeremy Kemp SSCertifiable Points: 6754 More actions · Answer 13

Some timings:

No UDFs (Basic) - 1036 rows/second

Basic + Currency Conversion UDFs - 770 rows/sec

Basic + Date UDFs - 550 rows/sec

Basic + Currency UDFs + Date UDFs - 480 rows/sec

Basic + Date validation as case statement - 710 rows/sec

Basic + Date validation as case + table join for currency conversion - 880 rows/sec

Yes the UDFs were killing performance as several of you suggested. I am surprised by how much they affected performance especially the date UDFs when they are very simple statements.

I've also checked my cursor and confess to not using UDFs in there for date validation - I was using a much simpler if statement so the comparison between cursor processing and set processing is perhaps not fair.

As speed is important, I will recode my update processing to use set processing.

I'm interested in how people use UDFs but I will start another thread for that.

Thanks for all your help.

Jeremy

Antares686 SSC Guru Points: 125444 More actions · Answer 14

quote:
Is the following statement a valid example of the checksum-statement ? I've read about the * on msdn (-which does not mention the abality to use the table-qualifications as t. and o.) but am unable to test because I have a couple of days off right now
UPDATE o
SET
o.data1 = t.data1
, o.data2 = t.data2
, o.data3 = t.data3
, o.modify_date = @processing_date
, o.modify_flag = 'U'
FROM odstable o
INNER JOIN temptable t
ON t.key1 = o.key1
AND t.key2 = o.key2
AND t.key3 = o.key3
WHERE
(
checksum(o.*) <> checksum(t.*)
)

That would be a No. * will not work in this case.

And I do have to say SeekQuel that is an unique approach. Have to use sometime myself now.