SQLServerCentral Article

An in-depth look at change detection in SQL Server - Part 02

,

All enterprise class products have at least one requirement that requires the application to be able to identify changed records. These requirements may revolve around the following uses:

  • Auditing
  • Detecting checksum issues for custom replication conflict resolution
  • Tamper detection - detecting content changes/manipulation
  • Handling concurrency issues

The question therefore is: What are the various change detection and tamper protection mechanisms available within SQL Server?

Microsoft SQL Server provides about four (4) methods that may be used by teams for identification of modified records. While the requirement will dictate which of these functionalities need to be used, I have often seen teams getting confused between these options, which ultimately results in an incomplete realization of the requirement. The four functionalities are:

  • ROWVERSION
  • CHECKSUM
  • BINARY_CHECKSUM
  • HASHBYTES

We have looked at the first two change detection mechanisms in Part 1 of this two part series. In this part, we will be looking at the remaining two methods (BINARY_CHECKSUM and HASHBYTES), and finally come up with a comparative analysis/study of all the four.

BINARY_CHECKSUM

The BINARY_CHECKSUM system function is an improvement to CHECKSUM and performs a binary evaluation of the data. BINARY_CHECKSUM computes the checksum based on this binary representation of the data. Therefore, the BINARY_CHECKSUM promises to be a dramatic improvement over CHECKSUM.

We know that CHECKSUM can fail in the following scenarios:

  1. Detecting case changes on a case-insensitive changes
  2. CHECKSUM cannot compare data across collations because different collations result in different CHECKSUM values
  3. CHECKSUM may fail to detect a symmetric change

Let us therefore test if BINARY_CHECKSUM can overcome the limitations of CHECKSUM:

BINARY_CHECKSUM is case-sensitive

To prove that BINARY_CHECKSUM is sensitive to case changes, let’s perform the following test. The tests are similar to those we performed for CHECKSUM in that we will be creating a temporary table with data columns and compute the BINARY_CHECKSUM over these.

USE tempdb;
GO
SET NOCOUNT ON;
--Create a test table
CREATE TABLE #BinaryCheckSumDemo 
      (Id INT IDENTITY (1,1),
       EmployeeName NVARCHAR(50),
       CityName NVARCHAR(50),
       BinaryCheckSum AS BINARY_CHECKSUM(EmployeeName, CityName)
      );
GO
--Insert some test data
INSERT INTO #BinaryCheckSumDemo 
       (EmployeeName, CityName)
VALUES ('John','Boston'),
       ('Tim','Paris'),
       ('Robert','New York'),
       ('Jim','Orlando');
GO
SELECT * FROM #BinaryCheckSumDemo;
GO

The BINARY_CHECKSUM values for the rows currently available in the table are given below:

BINARY_CHECKSUM is case-sensitive
Id EmployeeName CityName BinaryCheckSum
1 John Boston 67427454
2 Tim Paris 5458739
3 Robert New York -691938033
4 Jim Orlando 1213118463

Now, let us update the value of CityName for one of the records such that the case of one of the values changes.

--Try changing the case of the value and
--see if the BINARY_CHECKSUM value changes
UPDATE #BinaryCheckSumDemo
SET CityName = UPPER(CityName)
WHERE EmployeeName = 'Tim';
GO
SELECT * FROM #BinaryCheckSumDemo;
GO
Identifying record changes using BINARY_CHECKSUM - Case sensitivity
Id EmployeeName CityName BinaryCheckSum
1 John Boston 67427454
2 Tim PARIS 5335315
3 Robert New York -691938033
4 Jim Orlando 1213118463

As can be seen above highlighted in green, the BINARY_CHECKSUM computation for the updated record has been updated. This is because the binary representation of "PARIS" and "paris" is different, resulting in the different BINARY_CHECKSUM value. 

IMPORTANT: BINARY_CHECKSUM is able to detect case changes in a case insensitive database.

BINARY_CHECKSUM is Collation In-sensitive

One of the main drawbacks in case of CHECKSUM is the fact that it is collation sensitive. If the same data is stored using two different collations, CHECKSUM would incorrectly flag them as being different. If the data is the same, the checksum values should also be the same. Therefore, let us recreate this scenario for BINARY_CHECKSUM and study the behaviour:

USE tempdb;
GO
--Create a test table to study collation sensitivity of CHECKSUM
CREATE TABLE #BinaryChecksumCollationSensivity 
      (SourceKey INT IDENTITY(1,1),
       CaseInsensitiveColumn VARCHAR(20) 
                             COLLATE SQL_Latin1_General_CP1_CI_AS,
       CaseSensitiveColumn VARCHAR(20)
                           COLLATE SQL_Latin1_General_CP1_CS_AS,
       BinaryChecksumInsensitive AS BINARY_CHECKSUM(CaseInsensitiveColumn),
       BinaryChecksumSensitive AS BINARY_CHECKSUM(CaseSensitiveColumn)
      );
GO
--Insert some test data, make sure that both values have the same data
INSERT INTO #BinaryChecksumCollationSensivity
       (CaseInsensitiveColumn, CaseSensitiveColumn)
VALUES ('Microsoft', 'Microsoft' ),
       ('Windows', 'Windows' ), 
       ('Sql Server', 'Sql Server'),
       (NULL, NULL ),
       ('', '' );
GO
--Fetch the data from the test table
SELECT * FROM #BinaryChecksumCollationSensivity;
GO

As can be seen from the results table shown below, the BINARY_CHECKSUM values for both collations (case sensitive and the case insensitive) is the same.

Identifying record changes using BINARY_CHECKSUM - Collation insensitivity
SourceKey

CaseInsensitive

Column

CaseSensitive

Column

BinaryChecksum

Insensitive

BinaryChecksum

Sensitive

1 Microsoft Microsoft -196585121 -196585121
2 Windows Windows 1375217667 1375217667
3 Sql Server Sql Server -447589019 -447589019
4 NULL NULL 2147483647 2147483647
5 0 0

IMPORTANT: BINARY_CHECKSUM is not dependent upon the collation case-sensitivity of the database.

BINARY_CHECKSUM may fail to detect symmetric changes

Just when BINARY_CHECKSUM was appearing to be an ideal replacement for CHECKSUM, we run into another road-block. For the same reasons as CHECKSUM, the BINARY_CHECKSUM may fail to detect symmetric changes.

To confirm, let us repeat the test we did for CHECKSUM:

USE tempdb;
GO
SET NOCOUNT ON;
--Symmetric values fail CHECKSUM test
SELECT BINARY_CHECKSUM('Nakul') AS 'Before Change',
       BINARY_CHECKSUM('NANANANANANANANANakul') AS 'After Change';
GO

The result we get is shown below and confirms that BINARY_CHECKSUM was unable to detect a symmectic change:

Identifying record changes using BINARY_CHECKSUM - fails to detect symmetric changes
BeforeChange AfterChange
4750396 4750396

IMPORTANT: BINARY_CHECKSUM fails to detect symmetric changes.

BINARY_CHECKSUM is sensitive to the expression evaluation order

If the application has a very limited possibility of symmetrical changes and the application can afford to fail in detecting these changes once in a blue moon, then BINARY_CHECKSUM is a vast improvement over CHECKSUM.

Like any good tamper protection algorithm, BINARY_CHECKSUM also maintains sensitivity around expression evaluation order because of the very nature of the mathematics around computing a CHECKSUM.

Here’s a quick test:

USE tempdb;
GO
SET NOCOUNT ON;
--Create a test table
CREATE TABLE #BinaryCheckSumDemo 
      (Id INT IDENTITY (1,1),
       EmployeeName NVARCHAR(50),
       CityName NVARCHAR(50),
       DataBinaryCheckSum1 AS BINARY_CHECKSUM(EmployeeName, CityName),
       DataBinaryCheckSum2 AS BINARY_CHECKSUM(CityName, EmployeeName)
      );
GO
--Insert some test data
INSERT INTO #BinaryCheckSumDemo
       (EmployeeName, CityName)
VALUES ('John','Boston'),
       ('Tim','London'),
       ('Robert','New York'),
       ('Jim','Orlando');
GO
SELECT * FROM #BinaryCheckSumDemo;
GO
Identifying record changes using BINARY_CHECKSUM - sensitive to expression evaluation order
Id EmployeeName CityName DataBinaryCheckSum1 DataBinaryCheckSum2
1 John Boston 67427454 1149296654
2 Tim London 78425422 1251506973
3 Robert New York -691938033 538363564
4 Jim Orlando 1213118463 -2069377527

As can be seen in the table above, the BINARY_CHECKSUM values do not match for any evaluation sequence making it a great candidate for an ideal change detection mechanism.

IMPORTANT: In summary, BINARY_CHECKSUM improves a lot upon CHECKSUM in that it is a much advanced change-detection and tamper protection mechanism. However, there is room for improvement.

An improvement over both the CHECKSUM algorithms is achieved by using the HASHBYTES function.

HASHBYTES

The HASHBYTES system function returns a hash value based on the supplied input. This hash value does not follow a checksum algorithm, and therefore is immune from the problems faced by CHECKSUM and BINARY_CHECKSUM. The hash value generated by HASHBYTES can use one of the many supported encryption algorithms - MD2, MD4, MD5, SHA, SHA1, or SHA2.

The HASHBYTES function carries over all the desirable characteristics from BINARY_CHECKSUM, which are:

  1. The HASHBYTES function is able to detect data changes
  2. HASHBYTES can distinguish between the “false positives” and genuine updates
  3. HASHBYTES is case-sensitive
  4. HASHBYTES is collation insensitive
  5. HASHBYTES is sensitive to expression evaluation order

The only point where BINARY_CHECKSUM falls short is:

  • Detecting changes due to symmetric changes

In order to establish that HASHBYTES is indeed supreme, all we need to test for is the symmetric update test.

HASHBYTES can detect symmetric changes

To check whether or not HASHBYTES can detect symmetric changes, let us repeat the same test as we did for BINARY_CHECKSUM, this time however, we will use the HASHBYTES function with the MD5 encryption.

USE tempdb;
GO
SET NOCOUNT ON;
--Symmetric values fail CHECKSUM test
SELECT HASHBYTES('MD5','Nakul') AS 'Before Change',
       HASHBYTES('MD5','NANANANANANANANANakul') AS 'After Change';
GO

As seen in the results below, the HASHBYTES value for both data values are different. This is because the HASHBYTES is an encrypted form of the data - if the data is different, the encrypted result has to be different.

Using HASHBYTES to detect symmetric changes
BeforeChange AfterChange
0x67944B79EB69F8F7F16FA9240ABE80FC 0xADD6C6B9FC9D7BF0EDDA9DCA2DE70022

IMPORTANT: The HASHBYTES therefore is a “complete” change detection & tamper protection solution.

Storage considerations

In return for offering a complete and "tamper-proof" solution, HASHBYTES comes with a cost in terms of increased storage requirements.

Let us run the following query that creates a table with all known types of change detection mechanisms and attempts to identify the associated data-types:

USE tempdb;
GO
SET NOCOUNT ON;
--Create a test table
CREATE TABLE CheckSumDemo 
       (Id INT IDENTITY (1,1),
        EmployeeName NVARCHAR(50),
        CityName NVARCHAR(50),
        DataVersion ROWVERSION,
        DataCheckSum AS CHECKSUM(EmployeeName, CityName),
        DataBinaryCheckSum AS BINARY_CHECKSUM(EmployeeName, CityName),
        DataHashBytes AS HASHBYTES('MD5', (EmployeeName + CityName))
       );
GO
--Check for associated data-types
SELECT OBJECT_NAME(sc.object_id),
       sc.name AS ColumnName,
       st.name AS Datatype,
       sc.max_length AS ColumnLength,
       sc.collation_name AS Collation,
       sc.is_computed AS IsColumnComputed
FROM sys.columns AS sc 
LEFT OUTER JOIN sys.types AS st ON sc.user_type_id = st.user_type_id
WHERE sc.object_id = OBJECT_ID('CheckSumDemo')
ORDER BY sc.column_id;
GO

The result is shown below. As can be seen from the results table, CHECKSUMs occupy the least space, whereas HASHBYTES occupies the most.

Storage considerations for ROWVERSION, CHECKSUM, BINARY_CHECKSUM and HASHBYTES
TableName ColumnName DataType

Column

Length

Collation

IsColumn

Computed

CheckSumDemo Id int 4 NULL 0
CheckSumDemo EmployeeName nvarchar 100 SQL_Latin1_General_CP1_CI_AS 0
CheckSumDemo CityName nvarchar 100 SQL_Latin1_General_CP1_CI_AS 0
CheckSumDemo DataVersion timestamp 8 NULL 0
CheckSumDemo DataCheckSum int 4 NULL 1
CheckSumDemo DataBinaryCheckSum int 4 NULL 1
CheckSumDemo DataHashBytes varbinary 8000 NULL 1

The result obtained is quite interesting, and we can deduce the following facts:

  1. The ROWVERSION is not a computed column – the values are actually persisted within the database
  2. From a storage standpoint, both CHECKSUM and BINARY_CHECKSUM are the same
  3. HASHBYTES consume a comparatively large disk space – 8000 bytes worth of it!

Summary

Let us take a moment to compare the 4 known mechanisms for change detection and tamper protection within Microsoft SQL Server.

A comparison of the various change detection and tamper protection mechanisms available in Microsoft SQL Server
ROWVERSION CHECKSUM BINARY_CHECKSUM HASHBYTES
Concurrency Detection Yes Yes Yes Yes
Change Detection No Basic Yes Yes
Tamper Protection No No Basic Yes
Storage/Data-type Timestamp INT INT VARBINARY
Size Requirements 8 bytes 4 bytes 4 bytes 8000 bytes
Best Option Yes
"Value for Money" Yes   Yes

PLEASE NOTE: Timestamp is deprecated. The fact that ROWVERSION continues to be seen as a “timestamp” in the sys.types listing has been reported under Connect cases:

This comparison shows us that if the goal is simply to implement concurrency protection, ROWVERSION is the best and only choice. However, for change detection and basic tamper protection, BINARY_CHECKSUM should be used.

While HASHBYTES offers the best of all worlds, it comes at a cost – storage. Therefore, HASHBYTES should only be used if data accuracy is of utmost importance.

If you want to read more, Part 1 is here.

Food for thought

  • The change detection functions can be used to detect and delete duplicate data
  • If the goal for any system is basic change detection, BINARY_CHECKSUM offers the best value for money
  • ROWVERSION can be used for concurrency detection and therefore can be used to implement optimistic concurrency protection (An implementation of ROWVERSION for optimistic concurrency detection can be found on my blog at: http://bit.ly/VpEwbH)

Disclaimer

  • The scripts in this article are provided "as-is", i.e. without warranties of any kind and are intended for demonstration purposes only
  • Request you to use these scripts for understanding and study purposes in your development environments only - they are not meant for use in production. The author is not responsible for any damage caused by misuse of these scripts
  • Scripts are tested on SQL 2008 R2
  • The opinions expressed herein are his own personal opinions and do not represent his employer’s view in anyway

About the author

Nakul Vachhrajani is a Technical Specialist and systems development professional with iGATE having a total IT experience of more than 8 years. Nakul is an active blogger with BeyondRelational.com (230+ blogs), and can also be found on forums at SQLServerCentral and BeyondRelational.com. Nakul has also been a guest columnist for SQLAuthority.com and SQLServerCentral.com. Nakul presented a webcast on the “Underappreciated Features of Microsoft SQL Server” at the Microsoft Virtual Tech Days Exclusive Webcast series (May 02-06, 2011) on May 06, 2011. He is also the author of a research paper on Database upgrade methodologies, which was published in a Computer Society of India (CSI) journal, published nationwide. In addition to his passion about SQL Server, Nakul also contributes to the academia out of personal interest. He visits various colleges and universities as an external faculty to judge project activities being carried out by the students.

Blog | LinkedIn | Twitter | Google+

Rate

4.92 (13)

You rated this post out of 5. Change rating

Share

Share

Rate

4.92 (13)

You rated this post out of 5. Change rating