November 4, 2011 at 7:25 am
I haven't really explored this in any great depth, so I thought I'd put it out there. In general, when comparing field values across two tables to identify and return records with differences, is it better to:
1) Join the tables and in either the JOIN or WHERE clause add criteria to see where table1.field1 <> table2.field1, etc.
OR
2) Select all fields to compare from table 1
UNION ALL
Select all fields to compare from table 2
Then group by the fields, do a count(*) and then where count(*) > 1 you know that the records were the same, so you can discard them.
Mike
Mike Scalise, PMP
https://www.michaelscalise.com
November 4, 2011 at 7:31 am
i think you want to look at the EXCEPT operator;
if all the columns are the same, it might look like this:
--find what doesn't match from Table2
SELECT PK,ColumnList FROM Table1
EXCEPT
SELECT PK,ColumnList FROM Table2
SELECT PK,ColumnList FROM Table2
EXCEPT
SELECT PK,ColumnList FROM Table1
now, if the Columns to compare might match, but the PK is diferent, you just include the columns you want to specifically compare:
--find what doesn't match from Table2
SELECT ColumnList FROM Table1
EXCEPT
SELECT ColumnList FROM Table2
SELECT ColumnList FROM Table2
EXCEPT
SELECT ColumnList FROM Table1
Lowell
November 4, 2011 at 7:42 am
I don't know about "better", but generally when I'm comparing two algorithms that should produce the same result-set I'd do something like this: -
SELECT MIN(name) AS TableName, --all columns here
FROM (SELECT 'First table name' AS name, --all columns here
FROM --table name here
UNION ALL
SELECT 'Second table name' AS name, --all columns here
FROM --table name here
) work
GROUP BY --all columns here
HAVING COUNT(*) = 1
I'd implement it like this: -
IF object_id('tempdb..#table1') IS NOT NULL
BEGIN
DROP TABLE #table1
END
IF object_id('tempdb..#table2') IS NOT NULL
BEGIN
DROP TABLE #table2
END
SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,
(ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt
INTO #table1
FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3
SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,
(ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt
INTO #table2
FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3
--==Compare result-sets
SELECT MIN(name) AS TableName, ID, randomSmallInt
FROM (SELECT '#table1' AS name, ID, randomSmallInt
FROM #table1
UNION ALL
SELECT '#table2' AS name, ID, randomSmallInt
FROM #table2) work
GROUP BY ID, randomSmallInt
HAVING COUNT(*) = 1
ORDER BY ID
Bear in mind that due to me using random numbers between 1 and 10, you'll probably over a million results returned if you execute the above SQL.
November 4, 2011 at 7:48 am
Lowell (11/4/2011)
i think you want to look at the EXCEPT operator;if all the columns are the same, it might look like this:
--find what doesn't match from Table2
SELECT PK,ColumnList FROM Table1
EXCEPT
SELECT PK,ColumnList FROM Table2
SELECT PK,ColumnList FROM Table2
EXCEPT
SELECT PK,ColumnList FROM Table1
now, if the Columns to compare might match, but the PK is diferent, you just include the columns you want to specifically compare:
--find what doesn't match from Table2
SELECT ColumnList FROM Table1
EXCEPT
SELECT ColumnList FROM Table2
SELECT ColumnList FROM Table2
EXCEPT
SELECT ColumnList FROM Table1
Lowell,
So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):
Table 1 fields
-------------
Table 2 fields
is usually better than this type where it's all one row (my first example):
Table 1 fields | Table 2 fields
And that using the EXCEPT clause makes the most sense?
Mike Scalise, PMP
https://www.michaelscalise.com
November 4, 2011 at 7:52 am
mikes84 (11/4/2011)[hrLowell,
So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):
Table 1 fields
-------------
Table 2 fields
is usually better than this type where it's all one row (my first example):
Table 1 fields | Table 2 fields
And that using the EXCEPT clause makes the most sense?
I'll go to the SQL fallback answer: it depends.
if you just want to know what items are duplicated, yours and cadavre's answer will work; my example identifies items that are differnet or missing between the two tables...
so when i saw "compare two tables", that was my gut reaction...it's a pretty easy to use operator as well, so that's helpful
Lowell
November 4, 2011 at 7:58 am
Lowell (11/4/2011)
mikes84 (11/4/2011)[hrLowell,
So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):
Table 1 fields
-------------
Table 2 fields
is usually better than this type where it's all one row (my first example):
Table 1 fields | Table 2 fields
And that using the EXCEPT clause makes the most sense?
I'll go to the SQL fallback answer: it depends.
if you just want to know what items are duplicated, yours and cadavre's answer will work; my example identifies items that are differnet or missing between the two tables...
so when i saw "compare two tables", that was my gut reaction...it's a pretty easy to use operator as well, so that's helpful
I'm probably over-thinking this, so could you take a look and let me know what I'm doing wrong?
If I wanted to identify that items that were missing/different between the two tables and put them into one result-set, my instinct using EXCEPT would be like this: -
SELECT * FROM (
SELECT 'Table1' AS TableName, ID, randomSmallInt
FROM (SELECT ID, randomSmallInt
FROM #table1
EXCEPT
SELECT ID, randomSmallInt
FROM #table2) a
UNION ALL
SELECT 'Table2', ID, randomSmallInt
FROM (SELECT ID, randomSmallInt
FROM #table2
EXCEPT
SELECT ID, randomSmallInt
FROM #table1) b ) c
ORDER BY c.ID
This method doubles the IO that the method I posted requires, so I'm sure I'm heading down the wrong path.
Here's my test environment: -
IF object_id('tempdb..#table1') IS NOT NULL
BEGIN
DROP TABLE #table1
END
IF object_id('tempdb..#table2') IS NOT NULL
BEGIN
DROP TABLE #table2
END
SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,
(ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt
INTO #table1
FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3
SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,
(ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt
INTO #table2
FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3
PRINT '========== UNION ALL =========='
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT MIN(name) AS TableName, ID, randomSmallInt
FROM (SELECT '#table1' AS name, ID, randomSmallInt
FROM #table1
UNION ALL
SELECT '#table2' AS name, ID, randomSmallInt
FROM #table2) work
GROUP BY ID, randomSmallInt
HAVING COUNT(*) = 1
ORDER BY ID
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
PRINT '========== EXCEPT =========='
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT * FROM (
SELECT 'Table1' AS TableName, ID, randomSmallInt
FROM (SELECT ID, randomSmallInt
FROM #table1
EXCEPT
SELECT ID, randomSmallInt
FROM #table2) a
UNION ALL
SELECT 'Table2', ID, randomSmallInt
FROM (SELECT ID, randomSmallInt
FROM #table2
EXCEPT
SELECT ID, randomSmallInt
FROM #table1) b ) c
ORDER BY c.ID
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
========== UNION ALL ==========
(1799712 row(s) affected)
Table '#table1_____________________________________________________________________________________________________________0000000D3937'. Scan count 5, logical reads 2102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#table2_____________________________________________________________________________________________________________0000000D3938'. Scan count 5, logical reads 2102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6750 ms, elapsed time = 15694 ms.
========== EXCEPT ==========
(1799712 row(s) affected)
Table '#table1_____________________________________________________________________________________________________________0000000D3937'. Scan count 10, logical reads 4204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#table2_____________________________________________________________________________________________________________0000000D3938'. Scan count 10, logical reads 4204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 9704 ms, elapsed time = 17470 ms.
November 4, 2011 at 8:03 am
Cadavre I love how you add the million row test harnesses to your examples...
I may start adding that to some of my posts as well in the future.
Thanks buddy!
Lowell
November 4, 2011 at 8:05 am
Here's the situation I'm in:
I have the same 10 fields coming from two different systems, system1 and system2. I would like to do a comparison of all of the fields between the two systems to see if anything is different. Let's say out of 100 people from the two systems, 20 are different. I want to keep track of the fact that something between the two records is different, but there are 4 specific fields that are really important to me, and I need to go a step further and flag if any of those are different for the 20 people. Ultimately, I want to store 40 records in a log table (a system 1 record and a system 2 record for each person), with a flag for any records where any of those 4 important fields changed. My other challenge is that I want to ensure that I save the name of the system each record came from, which could pose a problem when using a union or except since that would instant differentiate each record.
Does that make sense?
Mike
Mike Scalise, PMP
https://www.michaelscalise.com
November 4, 2011 at 8:12 am
Lowell (11/4/2011)
Cadavre I love how you add the million row test harnesses to your examples...I may start adding that to some of my posts as well in the future.
Thanks buddy!
It's an idea I "borrowed" from Jeff 😀
mikes84 (11/4/2011)
Here's the situation I'm in:I have the same 10 fields coming from two different systems, system1 and system2. I would like to do a comparison of all of the fields between the two systems to see if anything is different. Let's say out of 100 people from the two systems, 20 are different. I want to keep track of the fact that something between the two records is different, but there are 4 specific fields that are really important to me, and I need to go a step further and flag if any of those are different for the 20 people. Ultimately, I want to store 40 records in a log table (a system 1 record and a system 2 record for each person), with a flag for any records where any of those 4 important fields changed. My other challenge is that I want to ensure that I save the name of the system each record came from, which could pose a problem when using a union or except since that would instant differentiate each record.
Does that make sense?
Mike
From my brief testing, the UNION method is slightly faster, because it only has to pass the tables once whereas to get the data you're asking for with EXCEPT would require two passes.
SELECT MIN(name) AS TableName, ID, randomSmallInt
FROM (SELECT '#table1' AS name, ID, randomSmallInt
FROM #table1
UNION ALL
SELECT '#table2' AS name, ID, randomSmallInt
FROM #table2) work
GROUP BY ID, randomSmallInt
HAVING COUNT(*) = 1
ORDER BY ID
Notice also that you can name each data-set in the UNION method (above I've called them "#table1" and "#table2").
Supply DDL and sample data and I'll code up an example based on your table structures.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply