February 6, 2012 at 2:18 pm
Hi:
I inherited a process that compares about 90 fields from an import staging table to the corresponding fields in our production database. We determine which records have updates and plug them into a change audit table. The process currently has a very, very long WHERE statement that looks like this:
WHERE
Staging.Column1 <> Prod.Column1
OR Staging.Column2 <> Prod.Column2
OR Staging.Column3 <> Prod.Column3
...and so on for all 90 fields. I had what I thought was the brilliant idea of using a BINARY_CHECKSUM comparison to clean up the process a bit, but I didn't take into account the fact that all of the fields in the staging table are varchar, which is not the case in the production table. The differing data types end up resulting in differing BINARY_CHECKSUM values even if the actual values of the fields are the same.
My question is this: Do you think I should use a CONVERT on every field in my BINARY_CHECKSUM to ensure that the data types are the same? Or would it be better to just keep the giant string of ORs and forget BINARY_CHECKSUM? Or is there another alternative I haven't tried that would work better?
Thanks in advance!
February 6, 2012 at 2:35 pm
can you create a view on top of the staging table tat would type the data into the same matching types as the production data?
then you could binary checksum the view to compare agaisnt production, right?
Lowell
February 7, 2012 at 6:48 am
Oh, good call! A view didn't even occur to me for some reason. Thanks! 🙂
February 7, 2012 at 7:01 am
Binary checksums are ok but they have thier limitations, asan Example if a Decimal number changes by a factor of 10m, eg 100.00 to 10.00 you will get the same checksum.
Proof
Declare @T1 as Decimal(18,6)
Declare @T2 as Decimal(18,6)
Set @T1=100.00
Set @T2=@T1/10
Print @T1
Print Binary_Checksum(@T1)
Print @T2
Print Binary_Checksum(@T2)
To show that it isnt the same if the Datatype is an Int
Declare @T3 as int
Declare @T4 as int
Set @T3=100
Set @T4=@T3/10
Print @T3
Print Binary_Checksum(@T3)
Print @T4
Print Binary_Checksum(@T4)
You might also want to look at HASHBYTES but it has additional overheads.
_________________________________________________________________________
SSC Guide to Posting and Best Practices
February 7, 2012 at 7:42 am
Thanks for the info! I did do some research into the limitations of BINARY_CHECKSUM, and I ended up landing on it because the data I'm dealing with won't be anything that requires a great degree of mathematical precision or calculation. It's really just demographic data with a few money fields.
I'm actually thinking that instead of building the view on the staging table, I might build it on the production table so I can make everything varchar and compare that way. I know comparing strings is less efficient than comparing integers, but this approach would have two advantages: One, I wouldn't have to go back and change the view if a production field's datatype happened to change (which is something over which I have no control); and two, it would eliminate any issues like the example you provided. What do you think? This is my first time doing something like this, so I want to make sure my reasoning is sound.
February 7, 2012 at 7:45 am
Have you thought about creating a calculated column using the BINARY_CHECKSUM on the table with to save on creating an additional view.
_________________________________________________________________________
SSC Guide to Posting and Best Practices
February 7, 2012 at 9:14 am
As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows. Some code to play with:
DECLARE @Stag TABLE (Column1 VARCHAR(100), Column2 VARCHAR(100), Column3 VARCHAR(100))
DECLARE @Prod TABLE (Column1 INT, Column2 DATETIME, Column3 VARCHAR(100))
INSERT INTO @Stag
VALUES
('1', '2012-12-01', 'Some Value'),
('2', '2011-11-06', 'Some Text'),
('3', '2012-01-01', 'More text too'),
('4', '2012-07-31', 'Any value'),
('5', '2012-08-30', 'Any Value')
INSERT INTO @Prod
VALUES
(1, '2012-12-01', 'Some Value'),
(2, '2011-11-05', 'Some Text'),
(3, '2012-01-01', 'More text'),
(4, '2012-07-31', 'Any value'),
(6, '2012-07-31', 'Text')
SELECT
CASE
WHEN [@Stag].Column1 IS NULL THEN 'Delete'
WHEN [@Prod].Column1 IS NULL THEN 'Insert'
WHEN NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) THEN 'Update'
END AS Op,
*
FROM
@Stag
FULL OUTER JOIN
@Prod ON [@Stag].Column1 = [@Prod].Column1
Edit: also note that BINARY_CHECKSUM is not always realible. From BOL:
BINARY_CHECKSUM(*) will return a different value for most, but not all, changes to the row, and can be used to detect most row modifications.
February 8, 2012 at 7:22 am
Peter Brinkhaus (2/7/2012)
As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.
Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
February 8, 2012 at 8:33 am
drew.allen (2/8/2012)
Peter Brinkhaus (2/7/2012)
As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.
Different semantic and results.
The NOT EXISTS ( ... INTERSECT ...) pattern is designed to entirely replace the long lists of expressions like (column1 <> column2) OR (column1 IS NULL and column2 IS NOT NULL) OR (column1 IS NOT NULL AND column2 IS NULL). See http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
February 8, 2012 at 9:32 am
SQL Kiwi (2/8/2012)
drew.allen (2/8/2012)
Peter Brinkhaus (2/7/2012)
As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.
Different semantic and results.
I don't see it. As far as I can tell these two queries produce exactly the same results and the execution plans are almost identical.
SELECT
CASE
WHEN [@Stag].Column1 IS NULL THEN 'Delete'
WHEN [@Prod].Column1 IS NULL THEN 'Insert'
WHEN NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) THEN 'Update'
END AS Op,
*
FROM
@Stag
FULL OUTER JOIN
@Prod ON [@Stag].Column1 = [@Prod].Column1
SELECT
CASE
WHEN [@Stag].Column1 IS NULL THEN 'Delete'
WHEN [@Prod].Column1 IS NULL THEN 'Insert'
WHEN EXISTS (SELECT [@Stag].* EXCEPT SELECT [@Prod].*) THEN 'Update'
END AS Op,
*
FROM
@Stag
FULL OUTER JOIN
@Prod ON [@Stag].Column1 = [@Prod].Column1
Admittedly, I haven't studied the execution plan in detail to see exactly how different they are, but a cursory overview indicates that EXCEPT has one fewer filters on it.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
February 8, 2012 at 9:37 am
drew.allen (2/8/2012)
I don't see it. As far as I can tell these two queries produce exactly the same results and the execution plans are almost identical.
I read your original brief statement as suggesting the whole thing could be replaced by EXCEPT. Now you have added the detail, your meaning is clear.
Admittedly, I haven't studied the execution plan in detail to see exactly how different they are, but a cursory overview indicates that EXCEPT has one fewer filters on it.
GBN from dba.se asked the same question recently on my blog. The answer I gave was:
In my experience the INTERSECT form tends to give produce 'better' plans, because the optimizer more readily simplifies the logic into a simple anti-semi-join.
Using the test script, I do get a slightly different plan for the EXCEPT form (an extra anti-semi-join, constant scan, and a start-up filter). This may be version-dependent (I used R2 build 2789).
Finally, I personally find the INTERSECT form to be more intuitive, but I can quite see that others might prefer to think of the problem the other way around.
Those comments refer to the example on my blog, not here, of course. The apparently more complex plan for INTERSECT in Peter's example is a consequence of the mismatched types. If the data types for @Stag and @Prod match, the usual INTERSECT plan appears.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply