Automate Data Purity Investigation
A couple of weeks ago I wrote an article at www.sqlcopilot.com/dbcc-checkdb-with-data_purity.html about the DATA_PURITY option of DBCC CHECKDB, and how to identify the affected columns.
One of the ways to find affected columns is to run a SELECT with a WHERE clause to return out-of-range data. But this doesn't always work. It is possible for data to be within a valid range but still fail the data purity check. It is also possible for the column to be totally corrupt, resulting in an arithmetic overflow error when you attempt to SELECT it.
The method I showed in my article for when a SELECT doesn't help was to use DBCC PAGE. But if you have more than a few columns with invalid data, it can be quite time consuming to do. This was the case for me recently when I migrated a database from SQL Server 2000 to 2008 R2 and the CHECKDB found 540,000 data purity errors!
A SELECT for values outside the range for the datatype (decimal(23, 8) in this case) didn't return any data and so DBCC PAGE was the only option - but obviously there was no way I could run it manually 540,000 times!
The script attached to this article was my solution to the problem.
1. It runs DBCC CHECKDB(dbname) WITH DATA_PURITY, NO_INFOMSGS, TABLERESULTS and captures the results in a temporary table
2. It extracts the page, slot, object id, column name and data type for each row returned
3. It loops through the results perfoming a DBCC PAGE for each one to get the primary key values of the rows containing the invalid data.
Within this loop, there is also code to derive the condition for the primary key (allowing for multi-column keys). The key value is then used to query the table to get the current value of the affected column. The results are stored in a table, tmp_final_results. This can be dropped once all investigation is complete.
There is further code, commented out, that I will explain shortly.
How to use the script
This is a 3 stage process.
1. First run the script within the context of the database to check.
It may take several hours to run, especially if you have a few hundred thousand data purity errors.
When it has finished, the table tmp_final_results will contain a row for each out-of-range column.
2. Use the first commented out section of the script to view the results.
Note: the conversion of [Value] to varchar is necessary, as attempting to retrieve some out-of-range data may result in arithmetic overflow errors. By converting them to varchar, these are displayed as -1.#IND instead.
3. Fix the data
Now you have to decide what to set each one to.
If you are lucky you may find that all the values are within a valid range for their datatypes, so a straightforward UPDATE to their existing value will fix the problem. The third section of my script (also commented out) generates an UPDATE statement for each column.
However, you may find out-of-range values, and some may be displayed as -1.#IND. For these you have to decide what they should be set to, and this means speaking with someone who knows the application well and getting them to look at your data.
4. Tidy up
The final commented out section needs to be run to drop the tables generated by the script.
Please remember that this script is a tool to aid in the identification of columns that have failed a data purity check. You should not blindly update the columns - ensure you are 100% confident of the correct values. However, the code in section 3 of the script is available if you decide the suggested values can be used. And of course, backup the database before making any data changes.
/*
Data Purity Investigation
STEP 1
Run this script within the context of the database you want to check (use [dbname];)
Once it has finished, which may take several hours if there are a lot of data purity errors, run the commented out SQL under step 2
to view the results, or use step 3 to generate UPDATE statements (be careful - read the article first).
Finally, run the commented out code under step 4 to tidy up.
*/SET NOCOUNT ON;
SET ANSI_WARNINGS OFF;
DECLARE
@db sysname,
@ver varchar(20),
@sql varchar(max),
@record varchar(255)
SELECT @db = DB_NAME()
SELECT @ver = CONVERT(varchar(20), SERVERPROPERTY('ProductVersion'))
SELECT @sql = 'DBCC CHECKDB (''' + @db + ''') WITH DATA_PURITY, NO_INFOMSGS, TABLERESULTS'
-- Create temporary tables. These will need to be dropped manually once the results of the script have analysed
CREATE TABLE tmp_dbcc_results
(
[ObjectId] int NULL,
[Column] sysname NULL,
[File] int NULL,
[Page] int NULL,
[Slot] int NULL
)
CREATE TABLE tmp_page_results
(
[ParentObject] sysname NULL,
[Object] sysname NULL,
[Field] sysname NULL,
[Value] sysname NULL
)
CREATE NONCLUSTERED INDEX idx_slot
ON tmp_page_results([Object], [Field])
INCLUDE([Value])
CREATE TABLE tmp_final_results
(
[Id] int IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
[Table] sysname NULL,
[Key] varchar(1000) NULL,
[Column] sysname NULL,
[Type] varchar(50) NULL,
[Value] sql_variant NULL
)
CREATE NONCLUSTERED INDEX idx_val
ON tmp_final_results([Value], [Table])
CREATE TABLE tmp_primary_keys
(
[Table] sysname, [Column] sysname, [Type] varchar(50)
)
-- Perform the data purity check. The output is version dependent, hence the condition code
IF @ver LIKE '9%' OR @ver LIKE '10%'
BEGIN
CREATE TABLE tmp_dbcc_results_to_2008_r2
(
[Error] int NULL,
[Level] int NULL,
[State] int NULL,
[MessageText] nvarchar(2048) NULL,
[RepairLevel] nvarchar(100) NULL,
[Status] int NULL,
[DbId] int NULL,
[ObjectId] int NULL,
[IndexId] int NULL,
[PartitionId] bigint NULL,
[AllocUnitId] bigint NULL,
[File] int NULL,
[Page] int NULL,
[Slot] int NULL,
[RefFile] int NULL,
[RefPage] int NULL,
[RefSlot] int NULL,
[Allocation] int NULL
)
INSERT tmp_dbcc_results_to_2008_r2
EXEC(@sql)
INSERT tmp_dbcc_results
SELECT [ObjectId],SUBSTRING(MessageText, CHARINDEX('). Column "', MessageText) + 11, CHARINDEX('"', MessageText, CHARINDEX('). Column "', MessageText) + 11) - CHARINDEX('). Column "', MessageText) - 11),[File],[Page],[Slot]
FROM tmp_dbcc_results_to_2008_r2
WHERE [Error] = 2570
DROP TABLE tmp_dbcc_results_to_2008_r2
END
ELSE IF @ver LIKE '11%'
BEGIN
CREATE TABLE tmp_dbcc_results_from_2012
(
[Error] int NULL,
[Level] int NULL,
[State] int NULL,
[MessageText] nvarchar(2048) NULL,
[RepairLevel] nvarchar(100) NULL,
[Status] int NULL,
[DbId] int NULL,
[DbFragId] int NULL,
[ObjectId] int NULL,
[IndexId] int NULL,
[PartitionId] bigint NULL,
[AllocUnitId] bigint NULL,
[RidDbId] int NULL,
[RidPruId] int NULL,
[File] int NULL,
[Page] int NULL,
[Slot] int NULL,
[RefDbId] int NULL,
[RefPruId] int NULL,
[RefFile] int NULL,
[RefPage] int NULL,
[RefSlot] int NULL,
[Allocation] int NULL
)
INSERT tmp_dbcc_results_from_2012
EXEC(@sql)
INSERT tmp_dbcc_results
SELECT [ObjectId],SUBSTRING(MessageText, CHARINDEX('). Column "', MessageText) + 11, CHARINDEX('"', MessageText, CHARINDEX('). Column "', MessageText) + 11) - CHARINDEX('). Column "', MessageText) - 11),[File],[Page],[Slot]
FROM tmp_dbcc_results_from_2012
WHERE [Error] = 2570
DROP TABLE tmp_dbcc_results_from_2012
END
-- Run DBCC PAGE for each error.
-- CHECKDB only generates one error per row, even if there are multiple columns with data purity issues.
-- This code checks the DBCC PAGE output to ensure these missing errors are captured.
DECLARE
@ObjectId int,
@Column nvarchar(255),
@File int,
@Page int,
@Slot int
DECLARE c CURSOR FOR
SELECT [ObjectId],[Column],[File],[Page],[Slot]
FROM tmp_dbcc_results
OPEN c
FETCH NEXT FROM c INTO @ObjectId, @Column, @File, @Page, @Slot
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @sql = 'DBCC PAGE (' + @db + ', ' + CONVERT(varchar(10), @File) + ', ' + CONVERT(varchar(20), @Page) + ', 3) WITH TABLERESULTS'
INSERT tmp_page_results
EXEC (@sql)
IF NOT EXISTS(SELECT 1
FROM tmp_primary_keys
WHERE [Table] = OBJECT_NAME(@ObjectId))
BEGIN
INSERT tmp_primary_keys
(
[Table],[Column],[Type]
)
SELECT OBJECT_NAME(@ObjectId),cu.COLUMN_NAME,t.name
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE cu
JOIN sys.columns c ON c.object_id = @ObjectId
AND c.name = cu.COLUMN_NAME
JOIN sys.types t ON c.user_type_id = t.user_type_id
JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc ON tc.TABLE_NAME = OBJECT_NAME(@ObjectId)
AND tc.CONSTRAINT_NAME = cu.CONSTRAINT_NAME
AND tc.CONSTRAINT_TYPE = 'PRIMARY KEY'
END
SELECT @record = 'Slot ' + CONVERT(varchar(10), @Slot) + ' %'
INSERT tmp_final_results
(
[Table],[Key],[Column],[Type],[Value]
)
SELECT OBJECT_NAME(@ObjectId),(SELECT LEFT(x.pk, LEN(x.pk) - 4)
FROM (SELECT '[' + pk.[Column] + '] = ' + CASE
WHEN pk.[Type] LIKE '%char%'
OR pk.[Type] LIKE '%date%' THEN ''''
ELSE ''
END + pr.Value + CASE
WHEN pk.[Type] LIKE '%char%'
OR pk.[Type] LIKE '%date%' THEN ''''
ELSE ''
END + ' and ' AS [text()]
FROM tmp_primary_keys pk
JOIN tmp_page_results pr ON pr.[Object] LIKE @record
AND pr.[Field] = pk.[Column]
WHERE pk.[Table] = OBJECT_NAME(@ObjectId)
FOR xml path('')) x(pk)),ic.Field,ict.name,NULL
FROM tmp_page_results ic
JOIN sys.columns icc ON icc.object_id = @ObjectId
AND icc.name = ic.[Field]
JOIN sys.types ict ON icc.user_type_id = ict.user_type_id
WHERE ic.[Object] LIKE @record
AND ( ic.[VALUE] = 'INVALID COLUMN VALUE'
OR ic.[Field] = @Column )
SELECT @sql = (SELECT 'update tmp_final_results set [Value] = (select [' + [Column] + '] from [' + [Table] + '] where ' + [Key] + ') where [Id] = ' + CONVERT(varchar(10), [Id]) + ';'
FROM tmp_final_results
WHERE [Table] = OBJECT_NAME(@ObjectId)
AND [Value] IS NULL
FOR xml path(''))
EXEC(@sql)
TRUNCATE TABLE tmp_page_results
FETCH next FROM c INTO @ObjectId, @Column, @File, @Page, @Slot
END
CLOSE c
DEALLOCATE c
/*
-- STEP 2
-- Run this to view the results.
-- The conversion to varchar is necessary in some cases to prevent arithmetic overflow errors
SELECT [Table], [Key], [Column], [Type], CONVERT(varchar(255), [Value]) AS [Value]
FROM tmp_final_results
*/
/*
-- STEP 3
-- Run this to generate the UPDATE statements that will fix the data purity issues
-- IMPORTANT - READ ACCOMPANYING ARTICLE FIRST - DO NOT RUN THE GENERATED UPDATE STATEMENTS UNLESS 100% SURE THEY ARE CORRECT
SELECT 'UPDATE [' + [Table] + '] SET [' + [Column] + '] = ' + CASE
WHEN [Type] LIKE '%char%'
OR [Type] LIKE '%date%' THEN ''''
ELSE ''
END + CONVERT(varchar(255), [Value]) + CASE
WHEN [Type] LIKE '%char%'
OR [Type] LIKE '%date%' THEN ''''
ELSE ''
END + ' WHERE ' + [Key] + '; '
FROM tmp_final_results
*/
/*
-- STEP 4
-- Finally run this bit to drop all temporary tables
DROP TABLE tmp_primary_keys
DROP TABLE tmp_final_results
DROP TABLE tmp_page_results
DROP TABLE tmp_dbcc_results
*/