Advice on BINARY_CHECKSUM or Alternatives

Question

Advice on BINARY_CHECKSUM or Alternatives

lek103

SSC-Addicted

Points: 432
More actions
February 6, 2012 at 2:18 pm

#251033

Hi:
I inherited a process that compares about 90 fields from an import staging table to the corresponding fields in our production database. We determine which records have updates and plug them into a change audit table. The process currently has a very, very long WHERE statement that looks like this:
WHERE
Staging.Column1 <> Prod.Column1
OR Staging.Column2 <> Prod.Column2
OR Staging.Column3 <> Prod.Column3
...and so on for all 90 fields. I had what I thought was the brilliant idea of using a BINARY_CHECKSUM comparison to clean up the process a bit, but I didn't take into account the fact that all of the fields in the staging table are varchar, which is not the case in the production table. The differing data types end up resulting in differing BINARY_CHECKSUM values even if the actual values of the fields are the same.
My question is this: Do you think I should use a CONVERT on every field in my BINARY_CHECKSUM to ensure that the data types are the same? Or would it be better to just keep the giant string of ORs and forget BINARY_CHECKSUM? Or is there another alternative I haven't tried that would work better?
Thanks in advance!

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323518 More actions · Answer 1

can you create a view on top of the staging table tat would type the data into the same matching types as the production data?

then you could binary checksum the view to compare agaisnt production, right?

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

lek103 SSC-Addicted Points: 432 More actions · Answer 2

Oh, good call! A view didn't even occur to me for some reason. Thanks! 🙂

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 3

Binary checksums are ok but they have thier limitations, asan Example if a Decimal number changes by a factor of 10m, eg 100.00 to 10.00 you will get the same checksum.

Proof

Declare @T1 as Decimal(18,6)

Declare @T2 as Decimal(18,6)

Set @T1=100.00

Set @T2=@T1/10

Print @T1

Print Binary_Checksum(@T1)

Print @T2

Print Binary_Checksum(@T2)

To show that it isnt the same if the Datatype is an Int

Declare @T3 as int

Declare @T4 as int

Set @T3=100

Set @T4=@T3/10

Print @T3

Print Binary_Checksum(@T3)

Print @T4

Print Binary_Checksum(@T4)

You might also want to look at HASHBYTES but it has additional overheads.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

lek103 SSC-Addicted Points: 432 More actions · Answer 4

Thanks for the info! I did do some research into the limitations of BINARY_CHECKSUM, and I ended up landing on it because the data I'm dealing with won't be anything that requires a great degree of mathematical precision or calculation. It's really just demographic data with a few money fields.

I'm actually thinking that instead of building the view on the staging table, I might build it on the production table so I can make everything varchar and compare that way. I know comparing strings is less efficient than comparing integers, but this approach would have two advantages: One, I wouldn't have to go back and change the view if a production field's datatype happened to change (which is something over which I have no control); and two, it would eliminate any issues like the example you provided. What do you think? This is my first time doing something like this, so I want to make sure my reasoning is sound.

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 5

Have you thought about creating a calculated column using the BINARY_CHECKSUM on the table with to save on creating an additional view.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

matzer SSCrazy Eights Points: 9727 More actions · Answer 6

As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows. Some code to play with:

DECLARE @Stag TABLE (Column1 VARCHAR(100), Column2 VARCHAR(100), Column3 VARCHAR(100))

DECLARE @Prod TABLE (Column1 INT, Column2 DATETIME, Column3 VARCHAR(100))

INSERT INTO @Stag

VALUES

('1', '2012-12-01', 'Some Value'),

('2', '2011-11-06', 'Some Text'),

('3', '2012-01-01', 'More text too'),

('4', '2012-07-31', 'Any value'),

('5', '2012-08-30', 'Any Value')

INSERT INTO @Prod

VALUES

(1, '2012-12-01', 'Some Value'),

(2, '2011-11-05', 'Some Text'),

(3, '2012-01-01', 'More text'),

(4, '2012-07-31', 'Any value'),

(6, '2012-07-31', 'Text')

SELECT

CASE

WHEN [@Stag].Column1 IS NULL THEN 'Delete'

WHEN [@Prod].Column1 IS NULL THEN 'Insert'

WHEN NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) THEN 'Update'

END AS Op,

*

FROM

@Stag

FULL OUTER JOIN

@Prod ON [@Stag].Column1 = [@Prod].Column1

Edit: also note that BINARY_CHECKSUM is not always realible. From BOL:

BINARY_CHECKSUM(*) will return a different value for most, but not all, changes to the row, and can be used to detect most row modifications.

drew.allen SSC Guru Points: 76988 More actions · Answer 7

Peter Brinkhaus (2/7/2012)
As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.

Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Paul White SSC Guru Points: 150467 More actions · Answer 8

drew.allen (2/8/2012)
Peter Brinkhaus (2/7/2012)
As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.
Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.

Different semantic and results.

The NOT EXISTS ( ... INTERSECT ...) pattern is designed to entirely replace the long lists of expressions like (column1 <> column2) OR (column1 IS NULL and column2 IS NOT NULL) OR (column1 IS NOT NULL AND column2 IS NULL). See http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

drew.allen SSC Guru Points: 76988 More actions · Answer 9

SQL Kiwi (2/8/2012)
drew.allen (2/8/2012)
Peter Brinkhaus (2/7/2012)
As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.
Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.
Different semantic and results.

I don't see it. As far as I can tell these two queries produce exactly the same results and the execution plans are almost identical.

SELECT

CASE

WHEN [@Stag].Column1 IS NULL THEN 'Delete'

WHEN [@Prod].Column1 IS NULL THEN 'Insert'

WHEN NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) THEN 'Update'

END AS Op,

*

FROM

@Stag

FULL OUTER JOIN

@Prod ON [@Stag].Column1 = [@Prod].Column1

SELECT

CASE

WHEN [@Stag].Column1 IS NULL THEN 'Delete'

WHEN [@Prod].Column1 IS NULL THEN 'Insert'

WHEN EXISTS (SELECT [@Stag].* EXCEPT SELECT [@Prod].*) THEN 'Update'

END AS Op,

*

FROM

@Stag

FULL OUTER JOIN

@Prod ON [@Stag].Column1 = [@Prod].Column1

Admittedly, I haven't studied the execution plan in detail to see exactly how different they are, but a cursory overview indicates that EXCEPT has one fewer filters on it.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Paul White SSC Guru Points: 150467 More actions · Answer 10

drew.allen (2/8/2012)
I don't see it. As far as I can tell these two queries produce exactly the same results and the execution plans are almost identical.

I read your original brief statement as suggesting the whole thing could be replaced by EXCEPT. Now you have added the detail, your meaning is clear.

Admittedly, I haven't studied the execution plan in detail to see exactly how different they are, but a cursory overview indicates that EXCEPT has one fewer filters on it.

GBN from dba.se asked the same question recently on my blog. The answer I gave was:

In my experience the INTERSECT form tends to give produce 'better' plans, because the optimizer more readily simplifies the logic into a simple anti-semi-join.

Using the test script, I do get a slightly different plan for the EXCEPT form (an extra anti-semi-join, constant scan, and a start-up filter). This may be version-dependent (I used R2 build 2789).

Finally, I personally find the INTERSECT form to be more intuitive, but I can quite see that others might prefer to think of the problem the other way around.

Those comments refer to the example on my blog, not here, of course. The apparently more complex plan for INTERSECT in Peter's example is a consequence of the mismatched types. If the data types for @Stag and @Prod match, the usual INTERSECT plan appears.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi