Checksum function???

Question

Post reply

Checksum function???

sql_learner29

SSCrazy Eights

Points: 9107
More actions
August 24, 2009 at 1:12 pm

#135065

Hi
Can anyone guide me on using checksum to compare million records in source and target...
I have to update a dimension if sumthing changes, I cant use lookup as it takes lot of time...
If anyone has ever used checksum() function, can you guide me on how to use it...
Any help is appreciated.
thanks
[font="Comic Sans MS"]
---------------------------------------------------
Thanks [/font]

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

sql_learner29 SSCrazy Eights Points: 9107 More actions · Answer 1

any help?

[font="Comic Sans MS"]
---------------------------------------------------

Thanks [/font]

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 2

Here is my two cents. Don't use it. It is not a guarantee that it will identify all changes to your data. It is possible for a combination of fields to result in the same checksum value. Your best way to identify changes is, in my opinion, a column by column comparision.

GSquared SSC Guru Points: 260824 More actions · Answer 3

You can use a checksum to narrow down the search. It can make a good leading-edge for what would otherwise be a multi-column index.

Checksum isn't exact, but it can reduce the number of rows that need to be compared, which can reduce I/O on large data sets.

What you do is work out what columns you are most likely to need to compare (might be all of them, might be a subset), and generate checksums for them. Persist and index those. Then, join the source table and target table on the checksum columns, as well as other criteria.

I've done speed tests with this in very large tables (one was half a petabyte for one table with hundreds of millions of rows), and it can result in much faster returns and much less I/O, when used on multi-column seeks/compares.

That's the general idea. Are you looking for specifics? If so, please clarify what specifics you need help on.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 4

Gus, are (did) you use CHECKSUM or BINARY_CHECKSUM? If I was going this direction, I would look at the later. Of course, I don't have the volume of data you are talking about either.

GSquared SSC Guru Points: 260824 More actions · Answer 5

I'm pretty sure it was checksum, not binary. Since I didn't care about the difference in case, which is the primary difference between the two.

select checksum('A'),checksum('a'), binary_checksum('A'), binary_checksum('a');

Results:

142 142 65 97

With what I was doing, case differences didn't matter, only actual content differences. That's the decision that determines which you should use. Of course, it's also critical that you use the same one on both sides of the equation.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Michael Ross-468051 Say Hey Kid Points: 689 More actions · Answer 6

How wide is the dimension, and are we talking type 1 or type 2 SCD columns?

sql_learner29 SSCrazy Eights Points: 9107 More actions · Answer 7

Thanks a lot GSquared....

From what I got is

I will have to add checksum column for each column that may change in my source in both the places....

Source as well as Target...

But my source is in different server (DB2) so I will have to stage it first and then add checksum for like 10 changing columns and similarly in my target i will add 10 checksum columns add will compare on them.????

Is this how checksum works...?

Actually my sroucre (DB2) has 7 million records...so i can't do row by row comparision(takes more than an hour)

Please let me know whr to add columns ? in SRC STG or in TARGET?

Thanks

[font="Comic Sans MS"]
---------------------------------------------------

Thanks [/font]

ducon Hall of Fame Points: 3801 More actions · Answer 8

Hi there,

Not sure this might help you out but in a project I'm working on, we have decided to go with an MD5 hash that is computed via a script task in our data flow (there are a bunch of ways to do this, that's the one we went with because we could easily switch for another scheme). The hash is saved in the dimension. When we get rows from the operational system, we compare the hash and based on that decide whether an update/insert/no action is needed.

The thing is that with checksums, hashes etc. you can have collisions, you just have to check how high the risk and if you can ever afford one. One could argue that it's the same with GUIDs, while it's highly improbable, it's not impossible. If you can't afford that then you're better off with either a column-by-column comparison or an extra column in your source data that can tell you the latest time something was changed or something like that.

Hope this helps,

Greg

Phil Parkin SSC Guru Points: 247215 More actions · Answer 9

SQL Learner (8/28/2009)
Thanks a lot GSquared....
From what I got is
I will have to add checksum column for each column that may change ...
Source as well as Target...
--edit

You need only a single checksum per row, based on all the columns whose values you are tracking for changes, eg

select checksum('A','B','Z')

34391

That's how you get the speed benefit - you need only compare two columns (PK and checksum) when looking for changes, rather than all columns.

Note also Gus' comments about Checksum and Binary_Checksum

JustANumber SSC Eights! Points: 905 More actions · Answer 10

I did this recently to determine if a row needed to be updated. I put a checksum column in the table using the values that could possibly be updated. The stored procedure had a checksum variable based on the input variables. The input variables for the proc are also the DB columns that could possibly be updated.

By comparing the DB checksum calculation with the stored procedure checksum variable I knew whether the row needed updating or not.

GSquared SSC Guru Points: 260824 More actions · Answer 11

Mark Pratt (9/1/2009)
I did this recently to determine if a row needed to be updated. I put a checksum column in the table using the values that could possibly be updated. The stored procedure had a checksum variable based on the input variables. The input variables for the proc are also the DB columns that could possibly be updated.
By comparing the DB checksum calculation with the stored procedure checksum variable I knew whether the row needed updating or not.

Makes sense.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

ducon Hall of Fame Points: 3801 More actions · Answer 12

Mark Pratt (9/1/2009)
I did this recently to determine if a row needed to be updated. I put a checksum column in the table using the values that could possibly be updated. The stored procedure had a checksum variable based on the input variables. The input variables for the proc are also the DB columns that could possibly be updated.
By comparing the DB checksum calculation with the stored procedure checksum variable I knew whether the row needed updating or not.

Hi,

Yep, this is exactly the point I was trying to make. Note that is very easily doable using a script component in a data flow and from what we've tried to far, goes pretty fast.

Regards,

Greg

sqlrunner Mr or Mrs. 500 Points: 546 More actions · Answer 13

If you are using sql 2005, and the columns you are interested in are varchar,nvarchar, or varbinary, you can try making a computed column out of the /b HashBytes /b function. There are a few limits, one being the total input size cannot be greater than 8k ( I think this was fixed in either sp3 or 2008). I also believe it is deterministic so you can put an index on it. Use the SHA1 algorithm as the chance of collision is 1 in 2^80.

Josef Richberg
2009 Exceptional DBA
http://www.josef-richberg.squarespace.com
http://twitter.com/sqlrunner

John Rowan SSC Guru Points: 56440 More actions · Answer 14

Is there any sort of date column in the source table that could show you which rows have been updated? If you could do something to avoid comparing the million rows, that may help.

Also, have you tried the SCD transformation? Between that and proper indexing on your dimension, you should still get decent performance out of the compare.

How much free reign do you have on the source system? Would it be possible to make changes there to aid in making this easier?

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden