Comparison between two tables

  • I haven't really explored this in any great depth, so I thought I'd put it out there. In general, when comparing field values across two tables to identify and return records with differences, is it better to:

    1) Join the tables and in either the JOIN or WHERE clause add criteria to see where table1.field1 <> table2.field1, etc.

    OR

    2) Select all fields to compare from table 1

    UNION ALL

    Select all fields to compare from table 2

    Then group by the fields, do a count(*) and then where count(*) > 1 you know that the records were the same, so you can discard them.

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • i think you want to look at the EXCEPT operator;

    if all the columns are the same, it might look like this:

    --find what doesn't match from Table2

    SELECT PK,ColumnList FROM Table1

    EXCEPT

    SELECT PK,ColumnList FROM Table2

    SELECT PK,ColumnList FROM Table2

    EXCEPT

    SELECT PK,ColumnList FROM Table1

    now, if the Columns to compare might match, but the PK is diferent, you just include the columns you want to specifically compare:

    --find what doesn't match from Table2

    SELECT ColumnList FROM Table1

    EXCEPT

    SELECT ColumnList FROM Table2

    SELECT ColumnList FROM Table2

    EXCEPT

    SELECT ColumnList FROM Table1

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • I don't know about "better", but generally when I'm comparing two algorithms that should produce the same result-set I'd do something like this: -

    SELECT MIN(name) AS TableName, --all columns here

    FROM (SELECT 'First table name' AS name, --all columns here

    FROM --table name here

    UNION ALL

    SELECT 'Second table name' AS name, --all columns here

    FROM --table name here

    ) work

    GROUP BY --all columns here

    HAVING COUNT(*) = 1

    I'd implement it like this: -

    IF object_id('tempdb..#table1') IS NOT NULL

    BEGIN

    DROP TABLE #table1

    END

    IF object_id('tempdb..#table2') IS NOT NULL

    BEGIN

    DROP TABLE #table2

    END

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

    (ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt

    INTO #table1

    FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

    (ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt

    INTO #table2

    FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

    --==Compare result-sets

    SELECT MIN(name) AS TableName, ID, randomSmallInt

    FROM (SELECT '#table1' AS name, ID, randomSmallInt

    FROM #table1

    UNION ALL

    SELECT '#table2' AS name, ID, randomSmallInt

    FROM #table2) work

    GROUP BY ID, randomSmallInt

    HAVING COUNT(*) = 1

    ORDER BY ID

    Bear in mind that due to me using random numbers between 1 and 10, you'll probably over a million results returned if you execute the above SQL.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Lowell (11/4/2011)


    i think you want to look at the EXCEPT operator;

    if all the columns are the same, it might look like this:

    --find what doesn't match from Table2

    SELECT PK,ColumnList FROM Table1

    EXCEPT

    SELECT PK,ColumnList FROM Table2

    SELECT PK,ColumnList FROM Table2

    EXCEPT

    SELECT PK,ColumnList FROM Table1

    now, if the Columns to compare might match, but the PK is diferent, you just include the columns you want to specifically compare:

    --find what doesn't match from Table2

    SELECT ColumnList FROM Table1

    EXCEPT

    SELECT ColumnList FROM Table2

    SELECT ColumnList FROM Table2

    EXCEPT

    SELECT ColumnList FROM Table1

    Lowell,

    So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):

    Table 1 fields

    -------------

    Table 2 fields

    is usually better than this type where it's all one row (my first example):

    Table 1 fields | Table 2 fields

    And that using the EXCEPT clause makes the most sense?

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • mikes84 (11/4/2011)[hrLowell,

    So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):

    Table 1 fields

    -------------

    Table 2 fields

    is usually better than this type where it's all one row (my first example):

    Table 1 fields | Table 2 fields

    And that using the EXCEPT clause makes the most sense?

    I'll go to the SQL fallback answer: it depends.

    if you just want to know what items are duplicated, yours and cadavre's answer will work; my example identifies items that are differnet or missing between the two tables...

    so when i saw "compare two tables", that was my gut reaction...it's a pretty easy to use operator as well, so that's helpful

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (11/4/2011)


    mikes84 (11/4/2011)[hrLowell,

    So are you saying that this type of comparison, row over row (as in the EXCEPT clause and my second example):

    Table 1 fields

    -------------

    Table 2 fields

    is usually better than this type where it's all one row (my first example):

    Table 1 fields | Table 2 fields

    And that using the EXCEPT clause makes the most sense?

    I'll go to the SQL fallback answer: it depends.

    if you just want to know what items are duplicated, yours and cadavre's answer will work; my example identifies items that are differnet or missing between the two tables...

    so when i saw "compare two tables", that was my gut reaction...it's a pretty easy to use operator as well, so that's helpful

    I'm probably over-thinking this, so could you take a look and let me know what I'm doing wrong?

    If I wanted to identify that items that were missing/different between the two tables and put them into one result-set, my instinct using EXCEPT would be like this: -

    SELECT * FROM (

    SELECT 'Table1' AS TableName, ID, randomSmallInt

    FROM (SELECT ID, randomSmallInt

    FROM #table1

    EXCEPT

    SELECT ID, randomSmallInt

    FROM #table2) a

    UNION ALL

    SELECT 'Table2', ID, randomSmallInt

    FROM (SELECT ID, randomSmallInt

    FROM #table2

    EXCEPT

    SELECT ID, randomSmallInt

    FROM #table1) b ) c

    ORDER BY c.ID

    This method doubles the IO that the method I posted requires, so I'm sure I'm heading down the wrong path.

    Here's my test environment: -

    IF object_id('tempdb..#table1') IS NOT NULL

    BEGIN

    DROP TABLE #table1

    END

    IF object_id('tempdb..#table2') IS NOT NULL

    BEGIN

    DROP TABLE #table2

    END

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

    (ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt

    INTO #table1

    FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

    (ABS(CHECKSUM(NEWID())) % 10) + 1 AS randomSmallInt

    INTO #table2

    FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

    PRINT '========== UNION ALL =========='

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECT MIN(name) AS TableName, ID, randomSmallInt

    FROM (SELECT '#table1' AS name, ID, randomSmallInt

    FROM #table1

    UNION ALL

    SELECT '#table2' AS name, ID, randomSmallInt

    FROM #table2) work

    GROUP BY ID, randomSmallInt

    HAVING COUNT(*) = 1

    ORDER BY ID

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    PRINT '========== EXCEPT =========='

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECT * FROM (

    SELECT 'Table1' AS TableName, ID, randomSmallInt

    FROM (SELECT ID, randomSmallInt

    FROM #table1

    EXCEPT

    SELECT ID, randomSmallInt

    FROM #table2) a

    UNION ALL

    SELECT 'Table2', ID, randomSmallInt

    FROM (SELECT ID, randomSmallInt

    FROM #table2

    EXCEPT

    SELECT ID, randomSmallInt

    FROM #table1) b ) c

    ORDER BY c.ID

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    ========== UNION ALL ==========

    (1799712 row(s) affected)

    Table '#table1_____________________________________________________________________________________________________________0000000D3937'. Scan count 5, logical reads 2102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#table2_____________________________________________________________________________________________________________0000000D3938'. Scan count 5, logical reads 2102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 6750 ms, elapsed time = 15694 ms.

    ========== EXCEPT ==========

    (1799712 row(s) affected)

    Table '#table1_____________________________________________________________________________________________________________0000000D3937'. Scan count 10, logical reads 4204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#table2_____________________________________________________________________________________________________________0000000D3938'. Scan count 10, logical reads 4204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 9704 ms, elapsed time = 17470 ms.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Cadavre I love how you add the million row test harnesses to your examples...

    I may start adding that to some of my posts as well in the future.

    Thanks buddy!

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Here's the situation I'm in:

    I have the same 10 fields coming from two different systems, system1 and system2. I would like to do a comparison of all of the fields between the two systems to see if anything is different. Let's say out of 100 people from the two systems, 20 are different. I want to keep track of the fact that something between the two records is different, but there are 4 specific fields that are really important to me, and I need to go a step further and flag if any of those are different for the 20 people. Ultimately, I want to store 40 records in a log table (a system 1 record and a system 2 record for each person), with a flag for any records where any of those 4 important fields changed. My other challenge is that I want to ensure that I save the name of the system each record came from, which could pose a problem when using a union or except since that would instant differentiate each record.

    Does that make sense?

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Lowell (11/4/2011)


    Cadavre I love how you add the million row test harnesses to your examples...

    I may start adding that to some of my posts as well in the future.

    Thanks buddy!

    It's an idea I "borrowed" from Jeff 😀

    mikes84 (11/4/2011)


    Here's the situation I'm in:

    I have the same 10 fields coming from two different systems, system1 and system2. I would like to do a comparison of all of the fields between the two systems to see if anything is different. Let's say out of 100 people from the two systems, 20 are different. I want to keep track of the fact that something between the two records is different, but there are 4 specific fields that are really important to me, and I need to go a step further and flag if any of those are different for the 20 people. Ultimately, I want to store 40 records in a log table (a system 1 record and a system 2 record for each person), with a flag for any records where any of those 4 important fields changed. My other challenge is that I want to ensure that I save the name of the system each record came from, which could pose a problem when using a union or except since that would instant differentiate each record.

    Does that make sense?

    Mike

    From my brief testing, the UNION method is slightly faster, because it only has to pass the tables once whereas to get the data you're asking for with EXCEPT would require two passes.

    SELECT MIN(name) AS TableName, ID, randomSmallInt

    FROM (SELECT '#table1' AS name, ID, randomSmallInt

    FROM #table1

    UNION ALL

    SELECT '#table2' AS name, ID, randomSmallInt

    FROM #table2) work

    GROUP BY ID, randomSmallInt

    HAVING COUNT(*) = 1

    ORDER BY ID

    Notice also that you can name each data-set in the UNION method (above I've called them "#table1" and "#table2").

    Supply DDL and sample data and I'll code up an example based on your table structures.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply