Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

  • Jonathan AC Roberts (6/12/2014)


    Normally when an example is given using values A, B and C they should be taken as algebraic variables who's contents can change.

    Agreed, it gives a solution that can be used in many situations.

    I tested Jeff's original solution and 2 other solutions on a 10-million-row random table (using the code attached in Jeff's article):

    DECLARE @a CHAR(1), @b-2 CHAR(1), @C CHAR(1)

    SELECT @a = 'X', @b-2 = 'F', @C = 'M'

    SET NOCOUNT ON

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    PRINT '

    ----- Jeff''s Original -----'

    SELECT CustomerID

    FROM #Purchase

    WHERE ProductCode IN (@A, @b-2)

    GROUP BY CustomerID

    HAVING COUNT(DISTINCT ProductCode) = 2

    EXCEPT

    SELECT CustomerID

    FROM #Purchase

    WHERE ProductCode = @C

    ---------------------------------

    PRINT '

    ----- NOT EXISTS -----'

    SELECT CustomerID

    FROM #Purchase P1

    WHERE ProductCode IN (@A, @b-2)

    AND NOT EXISTS (

    SELECT 1

    FROM #Purchase P2

    WHERE ProductCode = @C

    AND P1.CustomerID = P2.CustomerID

    )

    GROUP BY CustomerID

    HAVING COUNT(DISTINCT ProductCode) = 2

    ---------------------------------

    PRINT '

    ----- SUM(CASE...) -----'

    SELECT CustomerID

    FROM #Purchase P1

    WHERE ProductCode IN (@A, @b-2, @C)

    GROUP BY CustomerID

    HAVING SUM(CASE ProductCode WHEN @a THEN 1 ELSE 0 END) > 0

    AND SUM(CASE ProductCode WHEN @b-2 THEN 1 ELSE 0 END) > 0

    AND SUM(CASE ProductCode WHEN @C THEN 1 ELSE 0 END) = 0

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    SET NOCOUNT OFF

    Results (cleaned for readability):

    ----- Jeff's Original -----

    Table '#Purchase'. Scan count 49960, logical reads 151394.

    SQL Server Execution Times: CPU time = 406 ms, elapsed time = 412 ms.

    ----- NOT EXISTS -----

    Table '#Purchase'. Scan count 3, logical reads 2160.

    SQL Server Execution Times: CPU time = 296 ms, elapsed time = 288 ms.

    ----- SUM(CASE...) -----

    Table '#Purchase'. Scan count 3, logical reads 2160.

    SQL Server Execution Times: CPU time = 655 ms, elapsed time = 661 ms.

  • The Wizard Of Oz (6/13/2014)


    I tested Jeff's original solution and 2 other solutions on a 10-million-row random table (using the code attached in Jeff's article):

    They are not equivalent and return different result.

    COUNT(DISTINCT ...) = 2

    and

    SUM(CASE WHEN ... THEN ... ELSE ... END) > 0

    Will return different results.


    N 56°04'39.16"
    E 12°55'05.25"

  • SwePeso (6/13/2014)


    They are not equivalent and return different result.

    Hmm, I'm going to be sceptical for now because I haven't found a case where any of the 3 solutions brought back different results 🙂

    What settings did you use for @a, @b-2 and @C so I can replicate?

  • You don't need to.

    The COUNT(DISTINCT ...) approach will only return the groups that has exactly one A and one B.

    The SUM(CASE ...) approach will return all groups having at least one A and at least one B. It will also return the groups having 20 A's and 14 B's.


    N 56°04'39.16"
    E 12°55'05.25"

  • SwePeso (6/13/2014)


    The COUNT(DISTINCT ...) approach will only return the groups that has exactly one A and one B.

    Not sure about this, I thought the whole point of the DISTINCT is that it collapses any and all duplicates in the group, so it returns all groups having at least one A and at least one B, just like the SUM(CASE ...) approach.

    SwePeso (6/13/2014)


    The SUM(CASE ...) approach will return all groups having at least one A and at least one B. It will also return the groups having 20 A's and 14 B's.

    Yes, I thought this was exactly what we needed?

  • So, depending on the distribution of the sample data, the two different queries will return different results.

    The COUNT(DISTINCT ... ) will return a fewer number of rows, than the SUM(CASE ...) will.


    N 56°04'39.16"
    E 12°55'05.25"

  • I don't think so, because before that point we've already done our GROUP BY CustomerID, so really the 2 methods we're comparing are:

    ...

    WHERE ProductCode IN (@A, @b-2)

    --people that bought A or B or AB

    AND NOT EXISTS (

    SELECT 1

    FROM #Purchase P2

    WHERE ProductCode = @C

    AND P1.CustomerID = P2.CustomerID

    )

    --people that bought (A or B or AB) and (not C)

    GROUP BY CustomerID

    HAVING COUNT(DISTINCT ProductCode) = 2

    --people that bought (AB) and (not C)

    ...

    WHERE ProductCode IN (@A, @b-2, @C)

    --people that bought (A or B or C or AB or AC or BC or ABC)

    GROUP BY CustomerID

    HAVINGSUM(CASE ProductCode WHEN @a THEN 1 ELSE 0 END) > 0

    AND SUM(CASE ProductCode WHEN @b-2 THEN 1 ELSE 0 END) > 0

    AND SUM(CASE ProductCode WHEN @C THEN 1 ELSE 0 END) = 0

    --people that bought (AB) and (not C)

    Either I need more coffee today (99% likely), or you do Peso (1%).

    I'm eager to learn more about SQL either way 🙂

  • I'm finding that the results are the same for all 3 queries regardless of the input parameters and, as far as I can see, logically they should be.

    SSCrazy, Can you supply some example data to highlight what you are saying here?

    I'm pretty sceptical as well and want some evidence that what you are saying is correct.

  • As always, an excellent article, Jeff. Then again, we've come to expect nothing less.

  • No. You are right.

    I am the one needing more coffee. Just a mind lapse and disregarding the DISTINCT for some reason.


    N 56°04'39.16"
    E 12°55'05.25"

  • Ed Wagner (6/13/2014)


    As always, an excellent article, Jeff. Then again, we've come to expect nothing less.

    Thanks, Ed. Better than the article, though, look at the great discussions going on. Lotsa good people with good ideas. That's why I love this place. Ya just gotta love this community!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Do we have a readymade 10 million/1million row generator for this problem?

  • Now that we have new analytical functions, just wanted to jump in on the new windowed functions fun 🙂

    ; WITH CTE AS

    (

    SELECT P.CustomerID ,P.ProductCode

    , CODE = CASE WHEN ( P.ProductCode = 'B'

    AND LAG (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) = 'A'

    AND LEAD (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) <> 'C' ) THEN 1

    ELSE 0

    END

    FROM #Purchase P

    WHERE P.ProductCode in ('A','B', 'C')

    GROUP BY P.CustomerID ,P.ProductCode

    )

    SELECT CustomerID

    FROM CTE

  • ColdCoffee (6/13/2014)


    Now that we have new analytical functions, just wanted to jump in on the new windowed functions fun 🙂

    ; WITH CTE AS

    (

    SELECT P.CustomerID ,P.ProductCode

    , CODE = CASE WHEN ( P.ProductCode = 'B'

    AND LAG (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) = 'A'

    AND LEAD (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) <> 'C' ) THEN 1

    ELSE 0

    END

    FROM #Purchase P

    WHERE P.ProductCode in ('A','B', 'C')

    GROUP BY P.CustomerID ,P.ProductCode

    )

    SELECT CustomerID

    FROM CTE

    What's the performance look like compared to the other methods?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ColdCoffee (6/13/2014)


    Do we have a readymade 10 million/1million row generator for this problem?

    Heh... apparently, you didn't read the article where is says... 😉

    The code above isn't adequate for performance testing. For those that want to explore and compare solutions of their own, I've attached code to build a million row test table at the bottom of this article in the "Resources" link.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 151 through 165 (of 166 total)

You must be logged in to reply to this topic. Login to reply