One of the
problems frequently faced by the DBAs or database developers is data comparison. Specifically, data residing in separate databases. The company I am working for has 2 types of databases: Access and SQL Server. Very often I need to compare data in different tables in different databases residing on different servers. After some extensive research on the WEB I have found a great article by Jonathan Gennick on http://www.oreillynet.com/pub/a/network/2002/04/08/compsets.html)
This article is the best in representing a comparison of two sets of rows, whether from the same table or from different tables, with the goal to determine if in fact the two row sets are the same. But (there is always a ”but”) the code in the article requires knowledge of the compared tables structure, and therefore needs to be written for every table individually. My company has to adhere to some very strict FDA regulations, according to which every script for comparison must be validated. That renders pretty much any static SQL solution as not acceptable.
I looked into certain of-the-shelf comparison tools like SQL Data Compare (Red Gate Software Inc.). This solution is not exactly what I was looking for. The tool allows you to compare some SQL Server table data to another SQL Server table data if tables have the same name but located in different databases or in the same database but on different servers.
Lets define the tasks and then try to solve it as generically as possible.
Task #1 – Migration from non SQL Server to SQL Server
From time to time we are making database and application migration from Access to SQL Server Our FDA regulated environment requires data comparison and comparison reports to show that data for each table is the same after migration.
Task #2
Data should be compared in structurally the same tables with the same names but between different databases on the same or different servers. For example, some lookup/code tables with company’s specific data should match (states, languages, projects, zip codes etc.).
Task #3:
Data comparison should be done between tables with different names but with the same structure and located in the same or in different databases on the same or different servers.
Task #4
Data comparison between tables with different names and different structure. For example, when data transfer process is running, it is placing data to some generic table with columns named generically (gcol1, gcol2…). And comparison of the tables is then required to make sure that all rows are pulled out properly or for some other reason. Another example: audit tables have some additional columns compare to the original structure, which requires the comparison task to identify any current rows that are missing from the audit table.
It is becoming clear now that there is a multitude of situations where data comparison between tables is required. You might have your own task that is not falling in any of the above mentioned 4 categories.
As you can appreciate by now, the challenge here is to build some scripts that would be generic enough to support differences in databases, table names, column names, primary keys and column types. In addition to that, some restrictions must be met when you need to provide comparison tools for data residing on different types of database engines, such as Access or Oracle.
Restrictions:
a)If columns in compared tables have the same name but different count, the table with lesser number of columns should be the first table in parameter list assuming that we compare all columns in table 1 with the same columns in table 2. It can happened with audit or history tables where some additional columns are customarily added like inserted date, reason for change, login, and so on.
b)Sequence of data in columns should be the same regardless of table and column names if column names are different.
There are a few solutions to the problem.
Solution 1 is to compare rows column by column. This method is good if you need to know what columns have differences when rows are not matching. This solution is very time and server resource consuming.
In my case I don’t need to find which columns have different values. I only need to answer the question: “Is the data in 2 tables identical” in the form “Yes/No”
Solution 2 is compare the whole row in table 1 to the whole row in table 2 by converting the values in all the columns to the varchar data type and concatenate them into 1 value. It means that each row will consists of 1 large string and we will compare 1 value per row. In addition, solution 2 can be extended for performance reason. Strings can be converted to numbers and comparison will be done between 2 numbers for each row.
Let’s see a simple example.
Create table customer (customer_id int, customer_nm varchar(25) , start_dt datetime, stop_dt datetime ) Create table processed_customer (customer_id int, customer_nm varchar(25) , start_dt datetime, stop_dt datetime ) Insert into customer (customer_id, customer_nm , start_dt, stop_dt) Values (1, 'Test Customer 1', '1/1/2002', '8/4/2004') Insert into customer (customer_id, customer_nm , start_dt, stop_dt) Values (2, 'Test Customer 2', '2/1/2004', null) Insert into processed_customer (customer_id, customer_nm , start_dt, stop_dt) Values (1, 'Test Customer 1', '1/1/2002', '8/4/2004') Insert into processed_customer (customer_id, customer_nm , start_dt, stop_dt) Values (2, 'Customer 2', '2/1/2004', null)
Next query returns the records that are in table customer and
not in table processed_customer.
Select customer_id, customer_nm , start_dt, stop_dt From customer Where IsNull(cast(customer_id as varchar), ‘‘) + IsNull(cast(customer_nm as varchar) , ‘‘) + IsNull(cast(start_dt as varchar) , ‘‘) + IsNull(cast(stop_dt as varchar) , ‘‘) not in (select IsNull(cast(customer_id as varchar), ‘‘) + IsNull(cast(customer_nm as varchar) , ‘‘) + IsNull(cast(start_dt as varchar) , ‘‘) + IsNull(cast(stop_dt as varchar) , ‘‘) From processed_customer)
This query returns:
customer_id customer_nm start_dt stop_dt ----------- ---------------------- ------------------- --------------- 2 Test Customer 2 2004-02-01 00:00:00 NULL (1 row(s) affected)
If you need to identify the records that are in the table processed_customer but not in the table customer, the query would look like:
Select customer_id, customer_nm , start_dt, stop_dt From processed_customer Where IsNull(cast(customer_id as varchar), ‘‘) + IsNull(cast(customer_nm as varchar) , ‘‘) + IsNull(cast(start_dt as varchar) , ‘‘) + IsNull(cast(stop_dt as varchar) , ‘‘) not in (select IsNull(cast(customer_id as varchar), ‘‘) + IsNull(cast(customer_nm as varchar) , ‘‘) + IsNull(cast(start_dt as varchar) , ‘‘) + IsNull(cast(stop_dt as varchar) , ‘‘) From customer)
This is a simple example of solution 2. All null values in columns replaced with a space (not with an empty string) in order to avoid the side effects of. For example:
CustomerID | CustomerFNM | CustomerLNM |
1 | NULL | Customer1 |
CustomerID | CustomerFNM | CustomerLNM |
1 | Customer1 | NULL |
Both tables will show output as ‘1Customer1’ if NULL will be replaced with empty space and declared identical, which is incorrect. In contrast, the first table will show ‘1 Customer1’ and the second ‘1Customer1’, which will enable us to detect a difference.
This method can be used as static if you know the table names and the columns.
When comparison is performed for some production tables, I assume that there is always a primary key in a table and it will be no duplicate records. Otherwise it may be necessary to verify that there are no duplicate records using the next query.
Select IsNull(cast(customer_id as varchar), ‘‘) + IsNull(cast(customer_nm as varchar) , ‘‘) + IsNull(cast(start_dt as varchar) , ‘‘) + IsNull(cast(stop_dt as varchar) , ‘‘), Count(*) From customer Group by IsNull(cast(customer_id as varchar), ‘‘) + IsNull(cast(customer_nm as varchar) , ‘‘) + IsNull(cast(start_dt as varchar) , ‘‘) + IsNull(cast(stop_dt as varchar) , ‘‘) Having count(*) > 1
Now lets attempt to write a more generic data comparison script.
--========================================================================= CREATE procedure dbo.COMPARE_TABLES_DATA @tbl_1 varchar(55), @dbnm_1 varchar(50) , @tbl_2 varchar(55), @dbnm_2 varchar(50), @col_order_same_flag char(1), -- Y if column order the same regardless of name; -- N - if order is different but names are the same @col_name_same_flag char(1) -- First table has all the column names the same as the second one, -- but the second one can have different order and more columns than the -- first one as begin declare @cmd varchar(8000), @namepart varchar(7000), @otherA varchar(400) , @otherS varchar(255), @rowscnt int, @minid int, @maxid int , @namepart2 varchar(7000) create table #tmpA (textA varchar(8000) ) create table #tmpS (textS varchar(8000) ) SET NOCOUNT ON IF (@col_name_same_flag <> 'Y' and @col_order_same_flag <> 'Y') begin raiserror (' Column name or column order must be the same ', 16,1) return end select @minid = 0, @maxid = 0, @namepart = '', @namepart2 = '' -- compare number of rows in tables. create table #tmp_rows (cnt int, tid int identity(1,1) ) set @cmd = ' select count(*) from ' + @dbnm_1 + '.dbo.' + @tbl_1 insert into #tmp_rows(cnt) exec (@cmd) set @cmd = ' select count(*) from ' + @dbnm_2 + '.dbo.' + @tbl_2 insert into #tmp_rows(cnt) exec (@cmd) IF ( (select cnt from #tmp_rows where tid = 1) <> (select cnt from #tmp_rows where tid = 2) ) begin print 'Data in table 1 and 2 are not matching ' return end create table #tmp(tid int identity(1,1), colnm varchar(80), colnm1 varchar(50)) create table #tmp_2(tid int identity(1,1), colnm varchar(80), colnm1 varchar(50)) select @cmd = ' select ''ISNULL(cast('' + sc.name + '' as varchar), '''''''') '', sc.name from ' + @dbnm_1 + '..syscolumns sc inner join ' + @dbnm_1 + '..sysobjects so on so.id = sc.id where so.name = ''' + @tbl_1 + '''' + ' order by sc.colid ' insert into #tmp(colnm,colnm1) exec(@cmd) select @minid = 1, @maxid = max(tid) from #tmp while (@minid <= @maxid) begin select @namepart = @namepart + colnm + ' + ' from #tmp where tid = @minid set @minid = @minid + 1 end -- take away last character '+' or ',' select @namepart = left(rtrim(ltrim(@namepart)), len(ltrim(rtrim(@namepart))) - 1) IF (@col_name_same_flag <> 'Y' and @col_order_same_flag = 'Y') BEGIN -- start IF for table 2 select @cmd = ' select ''ISNULL(cast('' + sc.name + '' as varchar), '''''''') '', sc.name from ' + @dbnm_2 + '..syscolumns sc inner join ' + @dbnm_2 + '..sysobjects so on so.id = sc.id where so.name = ''' + @tbl_2 + '''' + ' order by sc.colid ' insert into #tmp_2(colnm,colnm1) exec(@cmd) set @minid = 1 while (@minid <= @maxid) begin select @namepart2 = @namepart2 + colnm + ' + ' from #tmp_2 where tid = @minid select @minid = @minid + 1 end -- take away last character '+' or ',' select @namepart2 = left(rtrim(ltrim(@namepart2)), len(ltrim(rtrim(@namepart2))) - 1) END ELSE begin set @namepart2 = @namepart end select @otherS = ' FROM ' + @dbnm_1 + '..' + @tbl_1 select @otherA = ' FROM ' + @dbnm_2 + '..' + @tbl_2 set @cmd = ' select ' + @namepart + @otherS insert into #tmpS (textS) exec (@cmd) select @cmd = ' select ' + @namepart2 + @otherA insert into #tmpA (textA) exec (@cmd) --Number of records that exists in table 1 and not exists in table 2 select @rowscnt = count(*) from #tmpS where textS not in (select textA from #tmpA) -- get records if some missing IF ( @rowscnt > 0 ) begin --print 'Records that exists in table 1 ' + @tbl_1 + ' -- and not exists intable 2 ' + @tbl_2 select textS from #tmpS where textS not in (select textA from #tmpA) end --Number of records that exists in table 2 and not exists in table 1 select @rowscnt = count(*) from #tmpA where textA not in (select textS from #tmpS) -- get records if some missing IF ( @rowscnt > 0 ) begin -- print 'Records that exists in table 2 ' + @tbl_2 + ' -- and not exists in table 1 ' + @tbl_1 select textA from #tmpA where textA not in (select textS from #tmpS) end SET NOCOUNT OFF end -- ============================================================================
Solution 3 is to join the tables on all columns and if the result of the join has the same number of rows as each of them, then tables are matching.
For the tables above result will look as
Select count(*) From processed_customer pc Inner join customer c ON IsNull(cast(pc.customer_id as varchar), ‘‘) = IsNull(cast(c.customer_id as varchar), ‘‘) and IsNull(cast(pc.customer_nm as varchar) , ‘‘) = IsNull(cast(c.customer_nm as varchar) , ‘‘) and IsNull(cast(pc.start_dt as varchar) , ‘‘) = IsNull(cast(c.start_dt as varchar) , ‘‘) and IsNull(cast(pc.stop_dt as varchar) , ‘‘) = IsNull(cast(c.stop_dt as varchar) , ‘‘)
If number of records is not matching with number of records in the tables then there are some unmatched records. Otherwise, tables are matching. Generic procedure for the solution is in file Compare_Tables_Data_Solution3.txt
Those solutions are workable but are time consuming for a tables with 1,000,000+ rows and many columns. I have ran those procedures for tables with 1,300,000 rows and about 60 columns. Comparison
has been completed in about 4 min.
Solution 4 (Performance boost)
You may notice that converting to a string and then concatenating several strings and then comparing 2 long strings may not be very efficient as far as CPU utilization goes. The reason string concatenation may get slow is that it will most likely require a memory allocation for the result and discarding memory used for operands, unless of course you are using an optimizing compiler. This though can lead us to the idea of trying to calculate a hash against data in each compared row and only compare hash values. This approach may prove valid only in situations when a) your rows have a lot of columns to be compared and b) run time is an issue, which is not always the case for processes like the ones I am describing here. Still, let me illustrate how you could go about implementing the hash idea.
It can be done by converting values to a varbinary and then to a big integer. Let’s use customer and processed_customer tables for this example:
Select count(*) From processed_customer pc Inner join customer c ON convert(bigint,convert(varbinary(255), ISNULL( pc.customer_id , ‘ ‘) ) ) = convert(bigint,convert(varbinary(255), ISNULL( c.customer_id , ‘ ‘) ) ) and convert(bigint,convert(varbinary(255), ISNULL( pc.customer_nm , ‘ ‘) ) ) = convert(bigint,convert(varbinary(255), ISNULL( c.customer_nm , ‘ ‘) ) ) and convert(bigint,convert(varbinary(255), ISNULL( pc.start_dt , ‘ ‘) ) ) = convert(bigint,convert(varbinary(255), ISNULL( c.start_dt , ‘ ‘) ) ) and convert(bigint,convert(varbinary(255), ISNULL( pc.stop_dt , ‘ ‘) ) ) = convert(bigint,convert(varbinary(255), ISNULL( c.stop_dt , ‘ ‘) ) )
Generic procedure for the solution is in file Compare_Tables_Data_Solution4.txt. I have run these procedures for the same tables as solution 3 with 1,300,000 rows and about 60 columns. Comparison has been completed in about 75 seconds! It gives us a 3-4 times performance boost. The only danger with this solution, even though the chances are very slim, is that the different original values might produce the same hash and result will be wrong. This is really the case only for the tables with 1 or 2 columns. So, this method is recommended only for tables with 4+ columns and many rows to boost performance.
For the tables with 1-3 columns it is safer to use solution 1-2-3. And generally speaking performance will be the same or very close.
Again, let me emphasize that all those solutions are good for the situations when you only need to answer the question: “Is the data in 2 tables identical” in the form “Yes/No”, not for when you need to locate the differences. For the reason of clarity all script don’t have error handlers, input variables verification, and some other advanced features.
Conclusion
These ideas can be used not only for table to table data comparison but for many different types of data comparison. For example, compare values in some table columns to values in some other table columns. The stored procedure for solution 2 can be modified to output non matching records with one recordset. You can add server names as parameters of the stored procedures and modify stored procedures to be able to work between servers (If servers are linked). Or you can modify it in the way to utilize comparison between Access (or other database engine) and SQL Server. See example COMPARE_TABLES_DATA_ACCESS.txt
Example for the Access database shows how to compare data for exactly the same table, but if table structure for the second table somehow can be obtained within stored procedure then stored procedure can be extended the same way as example for data comparison between different tables in SQL Server shows.