T-SQL queries to simulate INTERSECT and EXCEPT but with records in the same table

  • Hi everyone,

    Here's the scenario. I have a table (let's call it MyTable) that consists of four fields: Id, Source, FirstField, and SecondField, where Source only takes one of two values: Source1 and Source2.

    The records in this table look as follows:

    I need to return, using 3 different T-SQL queries:

    1) Products that exist only in Source2 (in red above)

    2) Products that exist only in Source1 (in green above)

    3) Products that exist both in Source1 and Source2 (in black above)

    For 1) so far I've been doing something along the lines of

    SELECT * FROM MyTable WHERE Source=Source1 AND FirstField NOT IN (SELECT DISTINCT(FirstField) FROM MyTable WHERE Source=Source2)

    Not being a T-SQL expert myself, I'm wondering if this is the right or more efficient way to go. I have read about INTERSECT and EXCEPT, but I am a little unclear if they could be applied in this case out of the box.

    I hope I've explained myself, and would appreciate any tips or suggestions anyone can share. Thanks in advance.

  • gacanepa (6/4/2015)


    Hi everyone,

    Here's the scenario. I have a table (let's call it MyTable) that consists of four fields: Id, Source, FirstField, and SecondField, where Source only takes one of two values: Source1 and Source2.

    The records in this table look as follows:

    I need to return, using 3 different T-SQL queries:

    1) Products that exist only in Source2 (in red above)

    2) Products that exist only in Source1 (in green above)

    3) Products that exist both in Source1 and Source2 (in black above)

    For 1) so far I've been doing something along the lines of

    SELECT * FROM MyTable WHERE Source=Source1 AND FirstField NOT IN (SELECT DISTINCT(FirstField) FROM MyTable WHERE Source=Source2)

    Not being a T-SQL expert myself, I'm wondering if this is the right or more efficient way to go. I have read about INTERSECT and EXCEPT, but I am a little unclear if they could be applied in this case out of the box.

    I hope I've explained myself, and would appreciate any tips or suggestions anyone can share. Thanks in advance.

    First of all, never use DISTINCT when you're constructing a sub-query result to the right of IN. It is not a high performance thing to do because IN will operate just as fast whether there's duplicates in the set or not.

    Edit: Sorry if that sounded like a rant but I see it way too many times and it is one of my pet peeves.

    Otherwise, you're on the right track to use EXCEPT and INTERSECT.

    -- Red

    SELECT *

    FROM MyTable

    WHERE Source='Source1'

    EXCEPT

    SELECT *

    FROM MyTable

    WHERE Source='Source2';

    -- Green

    SELECT *

    FROM MyTable

    WHERE Source='Source2'

    EXCEPT

    SELECT *

    FROM MyTable

    WHERE Source='Source1';

    -- Black

    SELECT *

    FROM MyTable

    WHERE Source='Source2'

    INTERSECT

    SELECT *

    FROM MyTable

    WHERE Source='Source1';


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Thank you for taking the time to write! I believe your suggestion has definitely put me on the right track.

    However, let me ask you a further question.

    If I do either an EXCEPT or an INTERSECT between the results of 2 SELECT * queries in this case, the Id column will affect the overall result as there are no records with the same Id value.

    So I though of doing adding the relevant columns only in the SELECT statements, but in that case the performance is worse (the query takes 28 secs) than before (~15 secs) and causes ~2M logical reads:

    Scan count 2, logical reads 2047535, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Any ideas?

  • gacanepa (6/5/2015)


    Thank you for taking the time to write! I believe your suggestion has definitely put me on the right track.

    However, let me ask you a further question.

    If I do either an EXCEPT or an INTERSECT between the results of 2 SELECT * queries in this case, the Id column will affect the overall result as there are no records with the same Id value.

    So I though of doing adding the relevant columns only in the SELECT statements, but in that case the performance is worse (the query takes 28 secs) than before (~15 secs) and causes ~2M logical reads:

    Scan count 2, logical reads 2047535, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Any ideas?

    Let me know how the following works:

    DECLARE @MyTable AS TABLE (

    Id int IDENTITY(1, 1) NOT NULL PRIMARY KEY CLUSTERED,

    [Source] varchar(10) NOT NULL,

    FirstField varchar(20),

    SecondField varchar(30)

    );

    INSERT INTO @MyTable ([Source], FirstField, SecondField)

    SELECT 'Source1' AS [Source], 'Product 3 name' AS FirstField, 'Product 3 description' AS SecondField UNION ALL

    SELECT 'Source2', 'Product 1 name', 'Product 1 description' UNION ALL

    SELECT 'Source1', 'Product 1 name', 'Product 1 description' UNION ALL

    SELECT 'Source2', 'Product 4 name', 'Product 4 description' UNION ALL

    SELECT 'Source1', 'Product 5 name', 'Product 5 description' UNION ALL

    SELECT 'Source2', 'Product 6 name', 'Product 6 description';

    SELECT MT.*

    FROM @MyTable AS MT

    LEFT OUTER JOIN @MyTable AS MT2

    ON MT.FirstField = MT2.FirstField

    AND MT.[Source] <> MT2.[Source]

    WHERE MT.[Source] = 'Source2'

    AND MT2.FirstField IS NULL

    ORDER BY MT.[Source], MT.FirstField;

    SELECT MT.*

    FROM @MyTable AS MT

    LEFT OUTER JOIN @MyTable AS MT2

    ON MT.FirstField = MT2.FirstField

    AND MT.[Source] <> MT2.[Source]

    WHERE MT.[Source] = 'Source1'

    AND MT2.FirstField IS NULL

    ORDER BY MT.[Source], MT.FirstField;

    SELECT MT.*

    FROM @MyTable AS MT

    INNER JOIN @MyTable AS MT2

    ON MT.FirstField = MT2.FirstField

    AND MT.[Source] <> MT2.[Source]

    ORDER BY MT.[Source], MT.FirstField;

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Just another way to do it as well:

    DECLARE @MyTable AS TABLE (

    Id int IDENTITY(1, 1) NOT NULL PRIMARY KEY CLUSTERED,

    [Source] varchar(10) NOT NULL,

    FirstField varchar(20),

    SecondField varchar(30)

    );

    INSERT INTO @MyTable ([Source], FirstField, SecondField)

    SELECT 'Source1' AS [Source], 'Product 3 name' AS FirstField, 'Product 3 description' AS SecondField UNION ALL

    SELECT 'Source2', 'Product 1 name', 'Product 1 description' UNION ALL

    SELECT 'Source1', 'Product 1 name', 'Product 1 description' UNION ALL

    SELECT 'Source2', 'Product 4 name', 'Product 4 description' UNION ALL

    SELECT 'Source1', 'Product 5 name', 'Product 5 description' UNION ALL

    SELECT 'Source2', 'Product 6 name', 'Product 6 description';

    with Source1Only as (

    select

    FirstField, SecondField

    from

    @MyTable

    where

    [Source] = 'Source1'

    except

    select

    FirstField, SecondField

    from

    @MyTable

    where

    [Source] = 'Source2'

    )

    select mt.Source, mt.FirstField, mt.SecondField from @MyTable mt where exists(select 1 from Source1Only so where so.FirstField = mt.FirstField and so.SecondField = mt.SecondField)

    order by mt.Source, mt.FirstField, mt.SecondField;

    with Source2Only as (

    select

    FirstField, SecondField

    from

    @MyTable

    where

    [Source] = 'Source2'

    except

    select

    FirstField, SecondField

    from

    @MyTable

    where

    [Source] = 'Source1'

    )

    select mt.Source, mt.FirstField, mt.SecondField from @MyTable mt where exists(select 1 from Source2Only so where so.FirstField = mt.FirstField and so.SecondField = mt.SecondField)

    order by mt.Source, mt.FirstField, mt.SecondField;

    with CommonItems as (

    select

    FirstField, SecondField

    from

    @MyTable

    where

    [Source] = 'Source1'

    intersect

    select

    FirstField, SecondField

    from

    @MyTable

    where

    [Source] = 'Source2'

    )

    select mt.Source, mt.FirstField, mt.SecondField from @MyTable mt where exists(select 1 from CommonItems ci where ci.FirstField = mt.FirstField and ci.SecondField = mt.SecondField)

    order by mt.Source, mt.FirstField, mt.SecondField;

  • Is there some reason you want to use multiple queries? One query can do this just fine. I've properly clustered the table to match lookup requirements.

    DECLARE @MyTable AS TABLE (

    Id int IDENTITY(1, 1) NOT NULL,

    [Source] varchar(10) NOT NULL,

    FirstField varchar(20),

    SecondField varchar(30),

    UNIQUE CLUSTERED ( [Source], FirstField )

    );

    SET NOCOUNT ON

    INSERT INTO @MyTable ([Source], FirstField, SecondField)

    SELECT 'Source1' AS [Source], 'Product 3 name' AS FirstField, 'Product 3 description' AS SecondField UNION ALL

    SELECT 'Source2', 'Product 1 name', 'Product 1 description' UNION ALL

    SELECT 'Source1', 'Product 1 name', 'Product 1 description' UNION ALL

    SELECT 'Source2', 'Product 4 name', 'Product 4 description' UNION ALL

    SELECT 'Source1', 'Product 5 name', 'Product 5 description' UNION ALL

    SELECT 'Source2', 'Product 6 name', 'Product 6 description';

    SET NOCOUNT OFF

    SELECT Id, Source, FirstField, SecondField,

    CASE WHEN Source = 'Source1'

    THEN CASE WHEN EXISTS(SELECT 1 FROM @MyTable mt2 WHERE mt2.Source = 'Source2' AND mt2.FirstField = mt.FirstField)

    THEN 'Both' ELSE 'Source1 Only' END

    ELSE CASE WHEN EXISTS(SELECT 1 FROM @MyTable mt2 WHERE mt2.Source = 'Source1' AND mt2.FirstField = mt.FirstField)

    THEN 'Both' ELSE 'Source2 Only' END

    END AS Matching_State

    FROM @MyTable mt

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply