Using EXCEPT to Determine Row Differences

  • I was going to report the same thing. The article should be updated to add rows to both tables, and to explain the usefulness of the joins, in restricting the except to only compare columns that exist ( as defined by the primary key ) in both tables.

  • Hi,

    I'm interested to learn whether this generates a high or low-cost queryplan?

    have you been comparing the entire dataset with a fresher copy each time, or just checking against a newly loaded subset?

    I'd guess that performance ought to be somewhat similar to join performance which forces a table lookup on both sides [to get the column data] ...

    clustering both tables over the pkey should be quickest, since ultimately it will have to check all the data across both tables.

    what's your performance experience been like?

  • To echo what other posters have said, this is a HUGE win, when comparing data. There have been countless times where I need to do exactly what you're describing, and this method is so unbelievably simple, it made my day.

    Cheers!

  • Just wanted to share this trick that I've used with a lot of success over the years:

    [font="Courier New"]

    SELECT 'In A not B' location,

    f1,

    f2,

    f3

    FROM a

    EXCEPT

    SELECT 'In A not B' location,

    f1,

    f2,

    f3

    FROM B

    UNION ALL

    SELECT 'In B not A' location,

    f1,

    f2,

    f3

    FROM B

    EXCEPT

    SELECT 'In B not A' location,

    f1,

    f2,

    f3

    FROM A

    UNION ALL

    (

    SELECT 'In Both' location,

    f1,

    f2,

    f3

    FROM B

    INTERSECT

    SELECT 'In Both' location,

    f1,

    f2,

    f3

    FROM A

    )

    [/font]

  • I have three tables I sometimes need to sync (development, acceptance and production) and use the following sort of query to find all differences. This is not quite what the author wanted which is differences for rows in both/all tables but could easily be modified to do so. Also note that my tables are relatively small and I have not had to worry about performance. Here is a query that compare constraints column usage across instances:

    select sum(src) as srcnum,

    case sum(src) when 1 then 'D__'

    when 2 then '_A_'

    when 3 then 'DA_'

    when 4 then '__P'

    when 5 then 'D_P'

    when 6 then '_AP'

    when 7 then 'DAP' end as src,

    a.TABLE_SCHEMA, a.TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

    from (

    select 1 src, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

    from Dev.information_schema.CONSTRAINT_COLUMN_USAGE

    union

    select 2 src, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

    from Acc.information_schema.CONSTRAINT_COLUMN_USAGE

    union

    select 4 src, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

    from Pro.information_schema.CONSTRAINT_COLUMN_USAGE

    ) a

    group by data_grp, a.TABLE_SCHEMA, a.TABLE_NAME, COLUMN_NAME, CONSTRAINT_SCHEMA, CONSTRAINT_NAME

    having sum(src) < 7

    order by data_grp, a.TABLE_SCHEMA, a.TABLE_NAME, COLUMN_NAME, CONSTRAINT_NAME, sum(src);

    For data table I typically simplify the inner union queries as 'select 1 src, *'. I know use of * is often discouraged. However, if it bombs because my columns are not in the same order or of different data types, that can be useful information as well.

  • victordba (3/26/2014)


    Thank you, @james.jensen1350 for your reply.

    I've been using MERGE for many years to do the upsert. And my source and target tables are large, with millions of records. So, MERGE takes it's sweet time to compare large tables to find the differences between those tables and to do upsert. But MERGE is not perfect, and you have to be very careful with it and also not to forget to take case of NULLs in every field. So, I am always on the lookout for some interesting and more efficient ways to do my upserts. And this article got my attention. I did some quick test comparisons between MERGE and EXCEPT to find changes between 2 tables with about 10 million records in each, and EXCEPT took about 35 seconds, while MERGE about 7 minutes. But I just ran EXCEPT once, to find changes in matching records. I would have to do a full test to also find new and missing records, so I would have to run EXCEPT 2 more times, and also run INSERT, UPDATE and DELETE commands when I find those differences. So, more results to follow. I was just wondering if anyone else tried some other ways to do UPSERTs besides MERGE command on large tables.

    @victordba - Did you ever prove whether the all-inclusive MERGE statement performed better or worse than all the separate statements needed when using the EXCEPT approach?

  • Regarding the query optimization, only the first join is needed, it ensures you are only taking the Source items that are also found on Target. The second one is not necessary.

    Actually, on this situation you could even avoid using joins and table aliases:

    SELECT ID

    , [Item]

    , [Price]

    , [OrderDate]

    , [Units]

    , [ShipmentDate]

    FROM [#Source]

    WHERE [ID] IN (SELECT [ID] FROM #Target)

    EXCEPT

    SELECT [ID]

    , [Item]

    , [Price]

    , [OrderDate]

    , [Units]

    , [ShipmentDate]

    FROM [#Target]

  • I don't know that where I could use the MERGE I would use EXCEPT. I've used EXCEPT a couple of times on permanent queries. The first was to take a list of thousands of phone numbers and add only the new ones to a list containing millions of phone numbers. The insert performed so quickly that I saw no need to try to find a different way to do it.

    I used EXCEPT to do the same thing and also found it very fast. This produced a source recordset that was then inserted into the table. As I only needed to insert as you can't update a phone number (a person's phone number yes, but not an actual phone number) I can't think that a MERGE would provide any advantage.

  • I have been using EXCEPT in ETL MERGE statements because it handles NULL values without the need for picking a suitable COALESCE default (TIME datatypes ?). I join the source and target rows using their primary key, and perform an EXISTS on the result of the EXCEPT statement between source and target. However I have found that for large tables, EXCEPT becomes less efficient. I can only assume this is because EXCEPT always compares all the column pairs in the the two rows, whereas I suspect individual ...OR src.ColA <> tgt.ColA... comparisons can return a result at the first TRUE condition without making any further comparisons.

  • Very nice. By using LEFT JOIN rather than INNER, the example also returns the unmatched rows.

    DELETE FROM #Target WHERE Item = 'Fig';

    GO

    INSERT INTO #Source

    ( [Item], [Price], [OrderDate], [Units], [ShipmentDate] )

    VALUES ( 'Prune', 2.49, '1/1/2001', NULL, '1/02/2001' );

    GO

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [#Source]

    LEFT JOIN [#Target] [T]

    ON .[ID] = [T].[ID]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [#Target] [T]

    LEFT JOIN [#Source]

    ON [T].[ID] = .[ID];

    returns 5 rows rather than the 3 from the example.

    There are no facts, only interpretations.
    Friedrich Nietzsche

  • I use this approach :

    MERGE statement and comparing columns with SET ANSI NULLS OFF.

    Any objections against using SET ANSI NULLS OFF ?

  • What do you think about using this in a trigger to find what has actually changed during a DML statement?

  • I scanned through the articles, but I may have missed it if someone discussed the "proper" way to do upserts with MERGE. I recommend http://www.made2mentor.com/2013/05/writing-t-sql-merge-statements-the-right-way/ , and for a lot of interesting details on query processor operators that aren't explicitly available in T-SQL you should read http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx . The upshot is that NOT EXISTS (SELECT S.* INTERSECT T.*) is a good way to check for row equivalence in a MERGE.

    When making changes to stored procedures or complicated queries, I frequently want to ensure that I'm not changing the result sets. In these situations, I'm not worried about duplicate rows, so EXCEPT works well. I generally use the following syntax for these sort of "diff" queries:

    WITH

    DSO AS (

    SELECT *

    FROM #Old

    ),

    DSN AS (

    SELECT *

    FROM #New

    )

    SELECT 'Old' AS Tbl, * FROM (SELECT * FROM DSO EXCEPT SELECT * FROM DSN) AS U1

    UNION

    SELECT 'New' AS Tbl, * FROM (SELECT * FROM DSN EXCEPT SELECT * FROM DSO) AS U2

    ORDER BY foo, Tbl;

    I use the above structure because if I need to ignore certain columns, I can do so in one place in the CTEs rather than having to repeat myself in the pair if EXCEPT sub-queries. In the ORDER BY, "foo" should be replaced by the candidate key columns to get changes to pair up nicely.

  • Does anyone have any information on the performance impact of using this method to create type 2 slowly changing dimensions versus using a merge statement?

  • I'm with the others that don't see a need for the join

    the query

    SELECT * FROM dbo.source

    except

    SELECT * FROM dbo.target

    gives me the same results! so? ...

    I was told to add a row to Source and try again...

    the extra row shows up in the results.... that is what you would want, yes/no?

    IDItemPriceOrderDateUnitsShipmentDate

    2Coconut0.992003-03-031.3500NULL

    3Eggplant22.442005-05-05NULL2005-05-06

    6Lychee0.292012-12-121000.55002012-12-14

    7GrApple2.222001-01-01NULL2001-01-02 <--this row in Source, that does not exist in Target showed up.

    So the inner join, does remove the GrApple row.

    example:

    SELECT * FROM dbo.source INNER JOIN dbo.target ON dbo.source.ID=dbo.target.id

    except

    SELECT * FROM dbo.target INNER JOIN dbo.source ON dbo.source.ID=dbo.target.id

    returns just the 3 rows:

    2Coconut0.992003-03-031.3500NULL

    3Eggplant22.442005-05-05NULL2005-05-06

    6Lychee0.292012-12-121000.55002012-12-14

    so the point is that you are only looking for rows that are DIFFERENT, that should be the same, excluding rows with new ID's which have been produced in the production environment.... but you are still able to verify that your copy is accurate, as far as what you have.....

    yes?

    Thanks!

    Peter A.R. Johnson

    DBA / Data Therapist / Federal Way Public Schools

Viewing 15 posts - 46 through 60 (of 72 total)

You must be logged in to reply to this topic. Login to reply