Comparison between two tables

Question

Post reply

Comparison between two tables

Mike Scalise

SSCrazy Eights

Points: 8145
More actions
November 4, 2011 at 7:25 am

#259544

I haven't really explored this in any great depth, so I thought I'd put it out there. In general, when comparing field values across two tables to identify and return records with differences, is it better to:
1) Join the tables and in either the JOIN or WHERE clause add criteria to see where table1.field1 <> table2.field1, etc.
OR
2) Select all fields to compare from table 1
UNION ALL
Select all fields to compare from table 2
Then group by the fields, do a count(*) and then where count(*) > 1 you know that the records were the same, so you can discard them.
Mike
Mike Scalise, PMP
https://www.michaelscalise.com

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323518 More actions · Answer 1

i think you want to look at the EXCEPT operator;

if all the columns are the same, it might look like this:

--find what doesn't match from Table2

SELECT PK,ColumnList FROM Table1

EXCEPT

SELECT PK,ColumnList FROM Table2

EXCEPT

SELECT PK,ColumnList FROM Table1

now, if the Columns to compare might match, but the PK is diferent, you just include the columns you want to specifically compare:

--find what doesn't match from Table2

SELECT ColumnList FROM Table1

EXCEPT

SELECT ColumnList FROM Table2

EXCEPT

SELECT ColumnList FROM Table1

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Cadavre SSC-Forever Points: 41690 More actions · Answer 2

I don't know about "better", but generally when I'm comparing two algorithms that should produce the same result-set I'd do something like this: -

SELECT MIN(name) AS TableName, --all columns here

FROM (SELECT 'First table name' AS name, --all columns here

FROM --table name here

UNION ALL

SELECT 'Second table name' AS name, --all columns here

FROM --table name here

) work

GROUP BY --all columns here

HAVING COUNT(*) = 1

I'd implement it like this: -

IF object_id('tempdb..#table1') IS NOT NULL

BEGIN

DROP TABLE #table1

END

IF object_id('tempdb..#table2') IS NOT NULL

BEGIN

DROP TABLE #table2

END

SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

(ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt

INTO #table1

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

(ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt

INTO #table2

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

--==Compare result-sets

SELECT MIN(name) AS TableName, ID, randomSmallInt

FROM (SELECT '#table1' AS name, ID, randomSmallInt

FROM #table1

UNION ALL

SELECT '#table2' AS name, ID, randomSmallInt

FROM #table2) work

GROUP BY ID, randomSmallInt

HAVING COUNT(*) = 1

ORDER BY ID

Bear in mind that due to me using random numbers between 1 and 10, you'll probably over a million results returned if you execute the above SQL.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 3

Lowell (11/4/2011)
i think you want to look at the EXCEPT operator;
if all the columns are the same, it might look like this:
--find what doesn't match from Table2
SELECT PK,ColumnList FROM Table1
EXCEPT
SELECT PK,ColumnList FROM Table2
SELECT PK,ColumnList FROM Table2
EXCEPT
SELECT PK,ColumnList FROM Table1
now, if the Columns to compare might match, but the PK is diferent, you just include the columns you want to specifically compare:
--find what doesn't match from Table2
SELECT ColumnList FROM Table1
EXCEPT
SELECT ColumnList FROM Table2
SELECT ColumnList FROM Table2
EXCEPT
SELECT ColumnList FROM Table1

Lowell,

So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):

Table 1 fields

-------------

Table 2 fields

is usually better than this type where it's all one row (my first example):

Table 1 fields | Table 2 fields

And that using the EXCEPT clause makes the most sense?

Mike Scalise, PMP
https://www.michaelscalise.com

Lowell SSC Guru Points: 323518 More actions · Answer 4

mikes84 (11/4/2011)[hrLowell,
So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):
Table 1 fields
-------------
Table 2 fields
is usually better than this type where it's all one row (my first example):
Table 1 fields | Table 2 fields
And that using the EXCEPT clause makes the most sense?

I'll go to the SQL fallback answer: it depends.

if you just want to know what items are duplicated, yours and cadavre's answer will work; my example identifies items that are differnet or missing between the two tables...

so when i saw "compare two tables", that was my gut reaction...it's a pretty easy to use operator as well, so that's helpful

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Cadavre SSC-Forever Points: 41690 More actions · Answer 5

Lowell (11/4/2011)
mikes84 (11/4/2011)[hrLowell,
So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):
Table 1 fields
-------------
Table 2 fields
is usually better than this type where it's all one row (my first example):
Table 1 fields | Table 2 fields
And that using the EXCEPT clause makes the most sense?
I'll go to the SQL fallback answer: it depends.
if you just want to know what items are duplicated, yours and cadavre's answer will work; my example identifies items that are differnet or missing between the two tables...
so when i saw "compare two tables", that was my gut reaction...it's a pretty easy to use operator as well, so that's helpful

I'm probably over-thinking this, so could you take a look and let me know what I'm doing wrong?

If I wanted to identify that items that were missing/different between the two tables and put them into one result-set, my instinct using EXCEPT would be like this: -

SELECT * FROM (

SELECT 'Table1' AS TableName, ID, randomSmallInt

FROM (SELECT ID, randomSmallInt

FROM #table1

EXCEPT

SELECT ID, randomSmallInt

FROM #table2) a

UNION ALL

SELECT 'Table2', ID, randomSmallInt

FROM (SELECT ID, randomSmallInt

FROM #table2

EXCEPT

SELECT ID, randomSmallInt

FROM #table1) b ) c

ORDER BY c.ID

This method doubles the IO that the method I posted requires, so I'm sure I'm heading down the wrong path.

Here's my test environment: -

IF object_id('tempdb..#table1') IS NOT NULL

BEGIN

DROP TABLE #table1

END

IF object_id('tempdb..#table2') IS NOT NULL

BEGIN

DROP TABLE #table2

END

SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

(ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt

INTO #table1

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

(ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt

INTO #table2

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

PRINT '========== UNION ALL =========='

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECT MIN(name) AS TableName, ID, randomSmallInt

FROM (SELECT '#table1' AS name, ID, randomSmallInt

FROM #table1

UNION ALL

SELECT '#table2' AS name, ID, randomSmallInt

FROM #table2) work

GROUP BY ID, randomSmallInt

HAVING COUNT(*) = 1

ORDER BY ID

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

PRINT '========== EXCEPT =========='

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECT * FROM (

SELECT 'Table1' AS TableName, ID, randomSmallInt

FROM (SELECT ID, randomSmallInt

FROM #table1

EXCEPT

SELECT ID, randomSmallInt

FROM #table2) a

UNION ALL

SELECT 'Table2', ID, randomSmallInt

FROM (SELECT ID, randomSmallInt

FROM #table2

EXCEPT

SELECT ID, randomSmallInt

FROM #table1) b ) c

ORDER BY c.ID

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

========== UNION ALL ==========

(1799712 row(s) affected)

Table '#table1_____________________________________________________________________________________________________________0000000D3937'. Scan count 5, logical reads 2102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#table2_____________________________________________________________________________________________________________0000000D3938'. Scan count 5, logical reads 2102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 6750 ms, elapsed time = 15694 ms.

========== EXCEPT ==========

(1799712 row(s) affected)

Table '#table1_____________________________________________________________________________________________________________0000000D3937'. Scan count 10, logical reads 4204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#table2_____________________________________________________________________________________________________________0000000D3938'. Scan count 10, logical reads 4204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 9704 ms, elapsed time = 17470 ms.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Lowell SSC Guru Points: 323518 More actions · Answer 6

Cadavre I love how you add the million row test harnesses to your examples...

I may start adding that to some of my posts as well in the future.

Thanks buddy!

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 7

Here's the situation I'm in:

I have the same 10 fields coming from two different systems, system1 and system2. I would like to do a comparison of all of the fields between the two systems to see if anything is different. Let's say out of 100 people from the two systems, 20 are different. I want to keep track of the fact that something between the two records is different, but there are 4 specific fields that are really important to me, and I need to go a step further and flag if any of those are different for the 20 people. Ultimately, I want to store 40 records in a log table (a system 1 record and a system 2 record for each person), with a flag for any records where any of those 4 important fields changed. My other challenge is that I want to ensure that I save the name of the system each record came from, which could pose a problem when using a union or except since that would instant differentiate each record.

Does that make sense?

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

Cadavre SSC-Forever Points: 41690 More actions · Answer 8

Lowell (11/4/2011)
Cadavre I love how you add the million row test harnesses to your examples...
I may start adding that to some of my posts as well in the future.
Thanks buddy!

It's an idea I "borrowed" from Jeff 😀

mikes84 (11/4/2011)
Here's the situation I'm in:
I have the same 10 fields coming from two different systems, system1 and system2. I would like to do a comparison of all of the fields between the two systems to see if anything is different. Let's say out of 100 people from the two systems, 20 are different. I want to keep track of the fact that something between the two records is different, but there are 4 specific fields that are really important to me, and I need to go a step further and flag if any of those are different for the 20 people. Ultimately, I want to store 40 records in a log table (a system 1 record and a system 2 record for each person), with a flag for any records where any of those 4 important fields changed. My other challenge is that I want to ensure that I save the name of the system each record came from, which could pose a problem when using a union or except since that would instant differentiate each record.
Does that make sense?
Mike

From my brief testing, the UNION method is slightly faster, because it only has to pass the tables once whereas to get the data you're asking for with EXCEPT would require two passes.

SELECT MIN(name) AS TableName, ID, randomSmallInt

FROM (SELECT '#table1' AS name, ID, randomSmallInt

FROM #table1

UNION ALL

SELECT '#table2' AS name, ID, randomSmallInt

FROM #table2) work

GROUP BY ID, randomSmallInt

HAVING COUNT(*) = 1

ORDER BY ID

Notice also that you can name each data-set in the UNION method (above I've called them "#table1" and "#table2").

Supply DDL and sample data and I'll code up an example based on your table structures.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/