All enterprise class products have at least one requirement that requires the application to be able to identify changed records. These requirements may revolve around any of the following uses:
- Auditing
- Detecting checksum issues for custom replication conflict resolution
- Tamper detection - detecting content changes/manipulation
- Handling concurrency issues
The question therefore is: What are the various change detection and tamper protection mechanisms available within SQL Server?
Microsoft SQL Server provides about four similar functionalities that may be used by teams for identification of modified records. While the requirement will dictate which of these functionalities need to be used, I have often seen teams getting confused between these options, which ultimately results in an incomplete realization of the requirement.
In this article, we will be taking an in-depth look at the first two of these four mechanisms:
- ROWVERSION
- CHECKSUM
The other two mechanisms will be covered in part 2.
ROWVERSION
Let us start with the most popular of all: ROWVERSION. ROWVERSION is a mechanism provided by SQL Server for version-stamping of rows. The term “version-stamping” means that ROWVERSION does not store a date/time stamp, it stores an incrementing number.
Every insert or update statement that gets executed against a SQL Server database increments an internal counter called the database ROWVERSION, provided the table being updated has a ROWVERSION column on it. Each table can have only one ROWVERSION column.
Let’s take a look at how ROWVERSION works. To begin with, let’s create a temporary table, and insert some test data into it. Note that this table has the ROWVERSION column whose values are of most interest to us:
USE tempdb GO SET NOCOUNT ON; --Create a test table CREATE TABLE #RowVersionDemo (Id INT IDENTITY (1,1), EmployeeName NVARCHAR(50), CityName NVARCHAR(50), DataVersion ROWVERSION ); GO --Insert some test data INSERT INTO #RowVersionDemo (EmployeeName, CityName) VALUES ('John','Boston'), ('Tim','London'), ('Robert','New York'), ('Jim','Orlando'); GO SELECT * FROM #RowVersionDemo; GO
The SELECT statement used to take a look at the RowVersion values yields the following result. Notice that the ROWVERSION is an incrementing counter value – the first record has a value of 80B, the next is 80B + 1 = 80C and so on.
Id | EmployeeName | CityName | DataVersion |
---|---|---|---|
1 | John | Boston | 0x000000000000080B |
2 | Tim | London | 0x000000000000080C |
3 | Robert | New York | 0x000000000000080D |
4 | Jim | Orlando | 0x000000000000080E |
Next, let us update one of the records, and then take a look at the value of the DataVersion column.
--Modify some test data, UPDATE #RowVersionDemo SET CityName = 'Paris' WHERE EmployeeName = 'Tim'; GO --Check the RowVersions SELECT * FROM #RowVersionDemo; GO
Id | EmployeeName | CityName | DataVersion |
---|---|---|---|
1 | John | Boston | 0x000000000000080B |
2 | Tim | Paris | 0x000000000000080F |
3 | Robert | New York | 0x000000000000080D |
4 | Jim | Orlando | 0x000000000000080E |
As seen above highlighted in green, the value of the DataVersion column has changed for the updated record. Again, this is a serial change from the last operation. The last record to be inserted/updated had a ROWVERSION of 80E, hence the update has a ROWVERSION of 80E + 1 = 80F. The other records did not undergo a change. Every time a row with a ROWVERSION column is inserted or modified, the value stored by the ROWVERSION column changes.
We now know that ROWVERSION can detect changes. But, what needs to be seen now is whether the ROWVERSION is really a “good” change-detection mechanism. To do so, let us update the column with itself.
--FALSE UPDATE SCENARIO: --Update the record value with itself and --then check the RowVersion UPDATE #RowVersionDemo SET CityName = CityName WHERE EmployeeName = 'Tim'; GO SELECT * FROM #RowVersionDemo; GO
Id | EmployeeName | CityName | DataVersion |
---|---|---|---|
1 | John | Boston | 0x000000000000080B |
2 | Tim | Paris | 0x0000000000000810 |
3 | Robert | New York | 0x000000000000080D |
4 | Jim | Orlando | 0x000000000000080E |
Something is not right in the results shown above. As highlighted in red, the value of the DataVersion column was changed (80F + 1 = 810) for the updated record despite the fact that there is no difference in the data available within the table.
IMPORTANT:
- ROWVERSION fails to identify if the data actually changed. ROWVERSION is therefore, not a good choice for detecting changes to a data row
- ROWVERSION columns are not a good choice for keys, especially primary keys
Why would ROWVERSION not be a good choice for keys? This is because with each update to the data row, the old key is no longer valid, and all foreign keys need to be updated with the new value.
ROWVERSION however, is ideal to identify if someone or some process “touched” a data row when making a data modification. Therefore, ROWVERSION is the way to go if you are looking to implement an optimistic concurrency solution within the product.
CHECKSUM
As seen above, one of the biggest challenge with ROWVERSION is the detection of "actual" data changes v/s the false positives.
CHECKSUM is a system function available with Microsoft SQL Server which returns the checksum value computed over the row of a table, or a list of expressions. A checksum is essentially a hash value computed over the supplied arguments.
With this knowledge, let us run the same tests again, however, this time, we will use a CHECKSUM column instead of a ROWVERSION column. To begin with, let’s create a test temporary table that computes the CHECKSUM:
USE tempdb; GO SET NOCOUNT ON; --Create a test table CREATE TABLE #CheckSumDemo (Id INT IDENTITY (1,1), EmployeeName NVARCHAR(50), CityName NVARCHAR(50), DataCheckSum AS CHECKSUM(EmployeeName, CityName) ); GO --Insert some test data INSERT INTO #CheckSumDemo (EmployeeName, CityName) VALUES ('John','Boston'), ('Tim','London'), ('Robert','New York'), ('Jim','Orlando'); GO SELECT * FROM #CheckSumDemo; GO
The contents of the table can be seen below.
Id | EmployeeName | CityName | DataCheckSum |
---|---|---|---|
1 | John | Boston | -1140789188 |
2 | Tim | London | -1718893255 |
3 | Robert | New York | 1723249590 |
4 | Jim | Orlando | 1075367414 |
In the results shown above, please observe the values in the DataCheckSum column. Unlike RowVersion, there is no relation between these values because they are hash values.
Now, let us modify a record and look at the values of the CHECKSUM. It is essentially the same test as we did for ROWVERSION and so, the modifications are also the same.
--Modify some test data UPDATE #CheckSumDemo SET CityName = 'Paris' WHERE EmployeeName = 'Tim'; GO --Check the Checksums SELECT * FROM #CheckSumDemo; GO
Id | EmployeeName | CityName | DataCheckSum |
---|---|---|---|
1 | John | Boston | -1140789188 |
2 | Tim | Paris | -1522883151 |
3 | Robert | New York | 1723249590 |
4 | Jim | Orlando | 1075367414 |
As highlighted in green in the results shown above, CHECKSUM was able to identify the data modifications. However, the acid test is whether CHECKSUM will be able to identify whether an update is a false update, i.e. a row has been updated with itself?
--FALSE UPDATE: --Update the record value with itself and --then check the Checksum value UPDATE #CheckSumDemo SET CityName = CityName WHERE EmployeeName = 'Tim'; GO SELECT * FROM #CheckSumDemo; GO
Id | EmployeeName | CityName | DataCheckSum |
---|---|---|---|
1 | John | Boston | -1140789188 |
2 | Tim | Paris | -1522883151 |
3 | Robert | New York | 1723249590 |
4 | Jim | Orlando | 1075367414 |
As seen in the results above, the green highlight is used to indicate that CHECKSUM was actually able to distinguish between an actual data change and a false positive.
IMPORTANT: CHECKSUMS therefore, can be used to identify data change.
This is very useful in a scenario where one needs to compare a replicating subscriber with the data published at the publisher. CHECKSUMs can easily tell us whether a record available at the subscriber is same as that on the publisher or is the publisher having a newer copy which due to multiple other reasons out of scope for the current discussion, has not made its way to the subscriber.
CHECKSUM is Case In-sensitive
The question now comes about the nature of data changes that CHECKSUM is able to identify. To do this let us perform another update. This time we will change the case of the value being updated. Please note that on my test system, the tempdb is not case-sensitive:
--FALSE UPDATE: --Update a record value with a different case and --then check the Checksum value UPDATE #CheckSumDemo SET CityName = UPPER(CityName) WHERE EmployeeName = 'Tim'; GO SELECT * FROM #CheckSumDemo; GO
Id | EmployeeName | CityName | DataCheckSum |
---|---|---|---|
1 | John | Boston | -1140789188 |
2 | Tim | PARIS | -1522883151 |
3 | Robert | New York | 1723249590 |
4 | Jim | Orlando | 1075367414 |
As seen above highlighted in red, the value of DataCheckSum column did not change indicating that CHECKSUM is not case sensitive.
CHECKSUM is collation sensitive
CHECKSUM failed to detect a change in case because CHECKSUM is collation sensitive. As verification of this fact, the same data stored under a different collation should return a different CHECKSUM value.
To test this theory, we will create a temporary table with two columns - one using a case sensitive collation (CS) and the other using a case insensitive collation (CI). Columns to hold CHECKSUM values are also created against each column. To make two columns use different collations within the same table, the DDL uses the COLLATE keyword followed by the required collation name.
The test data set used in this test has been chosen to cover most common user scenarios. We have a combination of user data, followed by NULL data (i.e. undefined data) and a set of empty strings.
USE tempdb; GO --Create a test table to study collation sensitivity of CHECKSUM CREATE TABLE #ChecksumCollationSensivity (SourceKey INT IDENTITY(1,1), CaseInsensitiveColumn VARCHAR(20) COLLATE SQL_Latin1_General_CP1_CI_AS, CaseSensitiveColumn VARCHAR(20) COLLATE SQL_Latin1_General_CP1_CS_AS, ChecksumInsensitive AS CHECKSUM(CaseInsensitiveColumn), ChecksumSensitive AS CHECKSUM(CaseSensitiveColumn) ); GO --Insert some test data, make sure that both values have the same data INSERT INTO #ChecksumCollationSensivity (CaseInsensitiveColumn, CaseSensitiveColumn) VALUES ('Microsoft', 'Microsoft' ), ('Windows', 'Windows' ), ('Sql Server', 'Sql Server'), (NULL, NULL ), ('', '' ); GO --Fetch the data from the test table SELECT * FROM #ChecksumCollationSensivity; GO
Now let us take a look at the data available in the test table:
SourceKey | CaseInsensitiveColumn | CaseSensitiveColumn | ChecksumInsensitive | ChecksumSensitive |
---|---|---|---|---|
1 | Microsoft | Microsoft | -110717694 | -394820589 |
2 | Windows | Windows | -1376614688 | -1125935136 |
3 | Sql Server | Sql Server | -1164226520 | -1198886856 |
4 | NULL | NULL | 2147483647 | 2147483647 |
5 | 0 | 0 |
In the results shown above, note that although the data in both columns (the case sensitive and insensitive collations) is the same, the CHECKSUM values for all records are different – except the empty string and the undefined NULL. Because the string is empty there is no data, and therefore CHECKSUM yields a 0. Because a NULL is an undefined value, the CHECKSUM is the same irrespective of the collation.
IMPORTANT: CHECKSUM is able to detect a data-change within the confines of the rules determined by the collation defined for the database/object.
CHECKSUM may fail to detect a change
Till now, we have established that CHECKSUM is able to distinguish between data updates v/s false positives and is case-insensitive, collation sensitive. However, CHECKSUM faces another issue – there are chances that within the same collation, CHECKSUM may fail to detect a change. We know that if the changes are symmetric, then mathematically the CHECKSUM of these values may have the cumulative effect of “cancelling” each other out. Let’s take a small example:
USE tempdb; GO SET NOCOUNT ON; --Symmetric values fail CHECKSUM test SELECT CHECKSUM('Nakul') AS 'Before Change', CHECKSUM('NANANANANANANANANakul') AS 'After Change'; GO
Checking the CHECKSUM values yields the following results:
Before Change | After Change |
---|---|
9663161 | 9663161 |
As you can see above, CHECKSUM failed to detect the data change – the data is different, but the nature of the change tricked the algorithm. With this observation, let us reword the conclusion from the previous section:
IMPORTANT: CHECKSUM is able to detect most data-changes within the confines of the rules determined by the collation defined for the database/object.
CHECKSUM is sensitive to the order of expression evaluation
The fact that CHECKSUM is sensitive to collations and prone to failure raises another question – Is CHECKSUM sensitive to the order in which it is evaluated?
To answer this question, let’s perform another simple test by creating a temporary table with two text data columns. The table also has two CHECKSUM columns, but the order in which the two data columns contribute to the checksum have been changed.
USE tempdb; GO SET NOCOUNT ON; --Create a test table CREATE TABLE #CheckSumDemo (Id INT IDENTITY (1,1), EmployeeName NVARCHAR(50), CityName NVARCHAR(50), DataCheckSum1 AS CHECKSUM(EmployeeName, CityName), DataCheckSum2 AS CHECKSUM(CityName, EmployeeName) ); GO --Insert some test data INSERT INTO #CheckSumDemo (EmployeeName, CityName) VALUES ('John','Boston'), ('Tim','London'), ('Robert','New York'), ('Jim','Orlando'); GO SELECT * FROM #CheckSumDemo; GO
Id | EmployeeName | CityName | DataCheckSum1 | DataCheckSum2 |
---|---|---|---|---|
1 | John | Boston | -1140789188 | 804583974 |
2 | Tim | London | -1718893255 | -1118192201 |
3 | Robert | New York | 1723249590 | 636918292 |
4 | Jim | Orlando | 1075367414 | 255996234 |
A quick glance at the results reveals that the two CHECKSUM values are different for every record. CHECKSUM is therefore dependent upon the order of the expressions being evaluated.
IMPORTANT: CHECKSUM can therefore be used to create a tamper protection solution.
Dependency on the order of evaluation is a desirable characteristic for a tamper protection solution. This is because a system may use columns in a non-sequential order to generate the checksum and this dependency on the evaluation sequence makes it difficult to reverse engineer the logic being used.
Summary
We have looked at two change detection mechanisms in this article: ROWVERSION and CHECKSUM. Summarizing the two methods gives us these conclusions:
- ROWVERSION fails to identify if the data actually changed. ROWVERSION is therefore, not a good choice for detecting changes to a data row
- ROWVERSION columns are not a good choice for keys, especially primary keys
- Use CHECKSUM instead of ROWVERSION for "actual" data change detection
- CHECKSUM is case in-sensitive
- CHECKSUM is collation sensitive
- CHECKSUM fails to detect symmetric changes
- CHECKSUM is sensitive to the order of expression evaluation
In the next part, we will be looking at the remaining two methods (BINARY_CHECKSUM and HASHBYTES), and finally come up with a comparitive analysis/study of all the four.
Disclaimer
- The scripts in this article are provided "as-is", i.e. without warranties of any kind and are intended for demonstration purposes only
- Request you to use these scripts for understanding and study purposes in your development environments only - they are not meant for use in production. The author is not responsible for any damage caused by misuse of these scripts
- Scripts are tested on SQL Server 2008 R2
- The opinions expressed herein are his own personal opinions and do not represent his employer’s view in anyway
About the author
Nakul Vachhrajani is a Technical Specialist and systems development professional with iGATE having a total IT experience of more than 8 years. Nakul is an active blogger with BeyondRelational.com (230+ blogs), and can also be found on forums at SQLServerCentral and BeyondRelational.com. Nakul has also been a guest columnist for SQLAuthority.com and SQLServerCentral.com. Nakul presented a webcast on the “Underappreciated Features of Microsoft SQL Server” at the Microsoft Virtual Tech Days Exclusive Webcast series (May 02-06, 2011) on May 06, 2011. He is also the author of a research paper on Database upgrade methodologies, which was published in a Computer Society of India (CSI) journal, published nationwide. In addition to his passion about SQL Server, Nakul also contributes to the academia out of personal interest. He visits various colleges and universities as an external faculty to judge project activities being carried out by the students.