September 16, 2015 at 6:12 am
samot-dwarf (9/16/2015)
Even if he would use the correct order, he must not use CHECKSUM because it is not case sensitive (if you are using the default collation for your DB):
SELECT CHECKSUM('ABC') chk_upper, CHECKSUM('abc') chk_lower, BINARY_CHECKSUM('ABC') bin_chk_upper, BINARY_CHECKSUM('abc') bin_chk_lower, HASHBYTES('MD5', 'ABC') hash_upper, HASHBYTES('MD5', 'abc') hash_lower
will return
chk_upper chk_lower bin_chk_upper bin_chk_lower hash_upper hash_lower
1132495864 1132495864 17763 26435 0x902FBDD2B1DF0C4F70B4A5D23525E932 0x900150983CD24FB0D6963F7D28E17F72
So if he uses CHECKSUM the users will not be able to update casing errors (e.g. Usa -> USA)
The updates would still happen, but they would be less efficient. Here's what happens:
1. Get the data.
2. Let user edit the data (or not.)
3. Check the checksum.
4. If the checksum is different, data has changed, so update the database.
5. If the checksum is the same, data may or may not have changed. Check the data against the database to be sure (this is the round-trip).
6. If the data has changed, update the database.
So as long as the second check is case sensitive, casing errors can be fixed.
September 16, 2015 at 9:29 am
FYI - some interesting Checksum behavior.
declare @STR varchar(8000) = '';
declare @i int = 0;
while @i < 256
begin
set @i = @i + 1;
set @STR = @STR + 'a'; -- also try 'A'
print @STR, checksum(@str);
end;
You will see that the checksums at every 16th character are the same. I.e., the checksum for 'a' and 17 'a's (or 33 'a's, etc.) are all the same.
Change 'a' to 'A' and you get the same values - ergo, a change that only affected capitalization will not be detected.
Using NVARCHAR simply causes the patterns to repeat every 32 characters instead of 16.
Bottom line - Checksum should only be used for what it is designed to do - generate integer Hash keys. Use Hashbytes to detect version changes of data if you are counting on detection working.
September 16, 2015 at 11:55 am
sknox (9/16/2015)
samot-dwarf (9/16/2015)
Even if he would use the correct order, he must not use CHECKSUM because it is not case sensitive (if you are using the default collation for your DB):
SELECT CHECKSUM('ABC') chk_upper, CHECKSUM('abc') chk_lower, BINARY_CHECKSUM('ABC') bin_chk_upper, BINARY_CHECKSUM('abc') bin_chk_lower, HASHBYTES('MD5', 'ABC') hash_upper, HASHBYTES('MD5', 'abc') hash_lower
will return
chk_upper chk_lower bin_chk_upper bin_chk_lower hash_upper hash_lower
1132495864 1132495864 17763 26435 0x902FBDD2B1DF0C4F70B4A5D23525E932 0x900150983CD24FB0D6963F7D28E17F72
So if he uses CHECKSUM the users will not be able to update casing errors (e.g. Usa -> USA)
The updates would still happen, but they would be less efficient. Here's what happens:
1. Get the data.
2. Let user edit the data (or not.)
3. Check the checksum.
4. If the checksum is different, data has changed, so update the database.
5. If the checksum is the same, data may or may not have changed. Check the data against the database to be sure (this is the round-trip).
Interesting. Not how I read "avoidd round-trip". To me it means not having to check the database to see if things have changed when you are not sure. Reading it your way, however, I agree that it is a cheap way to avoid some round trips (i.e. if the checksum has changed)
6. If the data has changed, update the database.
So as long as the second check is case sensitive, casing errors can be fixed.
Gerald Britton, Pluralsight courses
September 16, 2015 at 3:55 pm
The way this works in most applications
- I get data from the db, including a checksum
- I present the data to a use that may or may not change something
- I calculate a checksum against the data, compare it to the one from the db
- If the checksum is the same, I don't go to the db to check. Right or wrong, I don't.
- If the checksum is different, then I update the table, with a new round trip. This may or may not be needed, but I update things.
This allows me to reduce, not 100% eliminate or 100% require round trips. The idea for many of our frameworks is not to be perfect, but to reduce our work.
This is not like avoiding work by not checking. It's also not trying to ensure that I ever make a round trip I can avoid. The entire premise of hashing is that I can't be 100% sure that I don't have a false positive or false negative. I can get close, but it is possible that I will have an issue. In a real application, I'd use this with any volume of data to avoid round trips. However, I'd have a way for the user to "force" an update, regardless of any hashing algorithm.
I agree this question is a sketchy, because it shows a poor programming practice with CHECKSUM. This brings to mind the problems of trying to write a good question. The intention was to show issues wish hashing, not bugs in programming.
September 16, 2015 at 5:54 pm
Steve Jones - SSC Editor (9/16/2015)
The way this works in most applications- I get data from the db, including a checksum
- I present the data to a use that may or may not change something
- I calculate a checksum against the data, compare it to the one from the db
- If the checksum is the same, I don't go to the db to check. Right or wrong, I don't.
- If the checksum is different, then I update the table, with a new round trip. This may or may not be needed, but I update things.
This allows me to reduce, not 100% eliminate or 100% require round trips. The idea for many of our frameworks is not to be perfect, but to reduce our work.
This is not like avoiding work by not checking. It's also not trying to ensure that I ever make a round trip I can avoid. The entire premise of hashing is that I can't be 100% sure that I don't have a false positive or false negative. I can get close, but it is possible that I will have an issue. In a real application, I'd use this with any volume of data to avoid round trips. However, I'd have a way for the user to "force" an update, regardless of any hashing algorithm.
I agree this question is a sketchy, because it shows a poor programming practice with CHECKSUM. This brings to mind the problems of trying to write a good question. The intention was to show issues wish hashing, not bugs in programming.
- If the checksum is the same, I don't go to the db to check. Right or wrong, I don't.
Really? You would discard a potential update to the data?
In every industry I have worked in, that attitude would get you fired.
If the user's changed the data, they had a reason and it had better be saved to the database.
A different checksum is proof that the data has changed, so update away, as no other information is needed.
Even properly written, an equal checksum is not proof of anything and requires more information. Not checking at that point is pure negligence.
September 16, 2015 at 8:26 pm
Steve Jones - SSC Editor (9/16/2015)
- I get data from the db, including a checksum- I present the data to a use that may or may not change something
- I calculate a checksum against the data, compare it to the one from the db
- If the checksum is the same, I don't go to the db to check. Right or wrong, I don't.
- If the checksum is different, then I update the table, with a new round trip. This may or may not be needed, but I update things.
I'd prefer, instead of the checksum, to have the User's APP hold a "before" copy of the data (columns) and compare that, rather than the checksum. I suppose that might not be a good idea if the data being handed to the user to edit is "massive".
Preventing updates to unchanged data, client-side, is an interesting idea though; in our case if the user calls up a record to edit and then presses SAVE we save it ... however, the trigger would then reject it as unchanged (because if changed it also generates an audit record, and we don't want the bloat). But that policy has brought moans from the user:
"Why can't I save an unchanged record?"
"Why would you want to?"
"Because it sets the Modified Date which then triggers a transfer to RemoteSystemX which is always dropping loads of rows in the overnight transfer, and if I could just get the Modified Date tickled I could send it the row(s) again. Right now I have to add 'X' to the description, SAVE, and then re-edit to remove the 'X'"
😎
September 16, 2015 at 10:46 pm
A more common scenario would be something like this, given the facts that:
a) Inserts and update are relatively inexpensive
b) Data comparison is relatively expensive
c) Good enough algorithms are relatively inexpensive
Method for high collision hash algorithms (ie. CHECKSUM)
1. Load / insert data into a database
2. Create checksum/hash
3. Receive Delta data
4. Create checksum/hash for the Delta
5. Load / insert / update new hash value Delta
6. Compare data for identical hash value Delta
Method for low collision hash algorithms (ie. MD5, SHA1,Tiger etc.)
1. Load / insert data into a database
2. Create checksum/hash
3. Receive Delta data
4. Create checksum/hash for the Delta
5. Load / insert / update new hash value Delta
6. Ignore identical hash value Delta
The theoretical collision rate for algorithms such as MD5 and SHA1 is somewhere in the range of 2^60 or 1/1.15x10^18. Compared to other risks such as using NOLOCK or READUNCOMMITED, it is truly low enough to be ignored.
😎
Few years back I was working on a system that relied heavily on hashes for change detection and found that SHA1 is one of the faster algorithms generated by the HASHBYTES function in SQL Server (2012).
Good writup by Thomas Kejser, Exploring Hash Functions In SQL Server[/url]
September 17, 2015 at 2:20 am
Steve Jones - SSC Editor (9/16/2015)
I agree this question is a sketchy, because it shows a poor programming practice with CHECKSUM. This brings to mind the problems of trying to write a good question. The intention was to show issues wish hashing, not bugs in programming.
I don't think that the question is sketchy because of the fact that CHECKSUM is used. I think you would have proven your point perfectly that using a hashing algorithm is not always (or always not) 100% waterproof in detecting changes to data.
For me the sketchy part was using the incorrect order of the columns, making the CHECKSUM useless, and the use of the word "avoid", which I didn't interpret as "the application will not make a roundtrip".
Saying this all, I am quite sure that everyone who read this post is aware of this hashing issue now. 😀
September 27, 2015 at 5:22 am
Eirikur Eiriksson (9/15/2015)
robin 66943 (9/15/2015)
This makes no sense to me. If the checksum is calculated differently, then when the checksums match that will most likely imply the data is different, not the same, so a round-trip will be needed.Or maybe I'm missing something, in which can anyone explain this?
Don't think you are missing anything here! Without detailed knowledge of column order and column values in the checksum calculation, no assumption can be made on equality of the values, hence one has to do the round-trip anyway to check those, ergo the statement is false.
😎
Yes, I agree. It's a great pity that this is sitting there uncorrected.
I also agree with the comment that a developer that would do that is not a useful employee.
Tom
September 27, 2015 at 7:22 am
Eirikur Eiriksson (9/16/2015)
A more common scenario would be something like this, given the facts that:a) Inserts and update are relatively inexpensive
b) Data comparison is relatively expensive
c) Good enough algorithms are relatively inexpensive
Method for high collision hash algorithms (ie. CHECKSUM)
1. Load / insert data into a database
2. Create checksum/hash
3. Receive Delta data
4. Create checksum/hash for the Delta
5. Load / insert / update new hash value Delta
6. Compare data for identical hash value Delta
Method for low collision hash algorithms (ie. MD5, SHA1,Tiger etc.)
1. Load / insert data into a database
2. Create checksum/hash
3. Receive Delta data
4. Create checksum/hash for the Delta
5. Load / insert / update new hash value Delta
6. Ignore identical hash value Delta
The theoretical collision rate for algorithms such as MD5 and SHA1 is somewhere in the range of 2^60 or 1/1.15x10^18. Compared to other risks such as using NOLOCK or READUNCOMMITED, it is truly low enough to be ignored.
😎
Few years back I was working on a system that relied heavily on hashes for change detection and found that SHA1 is one of the faster algorithms generated by the HASHBYTES function in SQL Server (2012).
Good writup by Thomas Kejser, Exploring Hash Functions In SQL Server[/url]
Things change quite quickly in the hashing world; NIST added SHA2 with 256, 356, and 512 bit output to the Secure Hash Standard in 2002 - quite a long time ago - and although this was aimed at security rather than at reducing random collisions the increased hash length leads to vastly lower chances of random collision than SHA1 has. A 224 bit version was added to the standard a couple of years later. In the 512 bit variant of SHA2 the chance of two random strings having the same hash is 10^-154 as opposed to SHA1's 10^-48 (your 10^18 comes from SHA1's cryptographic strength, reduced from 80 bits tp 61 bits by know breaking techniques, which is irrelevant in the current - noncryptographic - context). So maybe it would be better to use SHA2 with 512 bits if the probability of random collisions has to be very low.
NIST added SHA3 to the standard last month; but the idea of that was to have a very different style of hash so that maybe cryptographic attacks that are effective for SHA2 won't be effective for SHA3 and vice versa. It doesn't add any extra hash length, so doesn't affect the random collision rate.
I wonder when (or if) SQL or T-SQL will add SHA2 support. Given the failure to support the modern float standard and get rid of the representation problem I don't expect anything to happen any time soon.
Tom
Viewing 10 posts - 31 through 39 (of 39 total)
You must be logged in to reply to this topic. Login to reply