Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

  • I see that most examples are accessing the same table more than once (IN/NOT IN, EXISTS/NOT EXISTS or JOINs). I wonder if it's faster than acessing the table only once and then, using HAVING to SUM() three times.

    Is there any particular reason to avoid NOT IN or NOT EXISTS in this case?

    Best regards,

    Andre Guerreiro Neto

    Database Analyst
    http://www.softplan.com.br
    MCITPx1/MCTSx2/MCSE/MCSA

  • Great Article! Thanks.

  • ColdCoffee (3/28/2012)


    tommyh (3/28/2012)


    Wouldnt something like this work just as well?

    SELECT

    CustomerId

    FROM #Purchase

    GROUP BY CustomerID

    having sum(case when ProductCode = 'A' then 1 else 0 end) > 0

    and sum(case when ProductCode = 'B' then 1 else 0 end) > 0

    and sum(case when ProductCode = 'C' then 1 else 0 end) = 0

    /T

    Yep, this is how i once implemented it! Cant remember the thread now.

    Here it is: http://www.sqlservercentral.com/Forums/FindPost1267224.aspx

    Proved to elimate the EXCEPT part; also does a single scan on the table.

    Was going to say that one as well. And the stats come up better when I test. Also, becuase the having clause meets the 3 required conditions you don't need the WHERE clause so tweak it there.

  • Nice, simple, and informative (especially the EXCEPT which I never use). the alternatives were also interesting. 🙂

  • Antares686 (3/29/2012)


    ColdCoffee (3/28/2012)


    tommyh (3/28/2012)


    Wouldnt something like this work just as well?

    SELECT

    CustomerId

    FROM #Purchase

    GROUP BY CustomerID

    having sum(case when ProductCode = 'A' then 1 else 0 end) > 0

    and sum(case when ProductCode = 'B' then 1 else 0 end) > 0

    and sum(case when ProductCode = 'C' then 1 else 0 end) = 0

    /T

    Yep, this is how i once implemented it! Cant remember the thread now.

    Here it is: http://www.sqlservercentral.com/Forums/FindPost1267224.aspx

    Proved to elimate the EXCEPT part; also does a single scan on the table.

    Was going to say that one as well. And the stats come up better when I test. Also, becuase the having clause meets the 3 required conditions you don't need the WHERE clause so tweak it there.

    Interesting fact here which can be explained by filtering of data I need by keeping the WHERE clause has a big impact on the query.

    WHERE ProductCode In ('A','B','C')

    Also, of note I tried the EXCEPT version and the EXISTS/NOT EXISTS using the process to generate a lot more rows but did not add the index on purpose and the EXCEPT and HAVING both came in much better over EXISTS, which complained in the Execution plan about a missing index. The other two did not complain.

  • Or you could do

    Select Distinct ABuyers.CustomerID from

    (Select CustomerID FROM #Purchase WHERE ProductCode = 'A') ABuyers

    JOIN

    (Select CustomerID FROM #Purchase WHERE ProductCode = 'B') BBuyers

    ON ABuyers.CustomerID = BBuyers.CustomerID

    LEFT JOIN

    (Select CustomerID FROM #Purchase WHERE ProductCode = 'C') CBuyers

    ON ABuyers.CustomerID = CBuyers.CustomerID

    WHERE CBuyers.CustomerID is null

  • Nice topic Jeff. Thanks for the article. I have seen this type of query written using the having + sums method, left join and exists/not exists methods but have never tested which one is best. I personally like the Exists/Not Exists method just for readability even though it sounds like it is less efficient than the having method unless the extra index is applied. I always forget about intersect and except as it seems like those methods are about the same as exists/not exists.

  • I've been faced with this problem many times and never adopted a standard methodology for dealing it with but am quite sure that my attemps were RBAR.

    Would this approach be considered set based?

    (please don't yell at me if my question was covered in an previous post. I didn't read all of them.)

  • jshahan (3/29/2012)


    I've been faced with this problem many times and never adopted a standard methodology for dealing it with but am quite sure that my attemps were RBAR.

    Would this approach be considered set based?

    (please don't yell at me if my question was covered in an previous post. I didn't read all of them.)

    Always confirm your standard approach, I have found on occasions that one approach may perform great and in another case a variant query works better. So this is one more tool for my options to consider.

  • Thanks, Jeff.

    I thought my standard pattern of simple joins would do about as well:

    SELECT DISTINCT p.CustomerID

    FROM

    #Purchase p

    JOIN #Purchase p2 ON

    p.CustomerID = p2.CustomerID AND

    p2.ProductCode = 'B'

    LEFT JOIN #Purchase p3 ON

    p.CustomerID = p3.CustomerID AND

    p3.ProductCode = 'C'

    WHERE

    p.ProductCode = 'A' AND

    p3.CustomerID IS NULL

    Indeed, at 1 million rows performance is similar:

  • Jeff's solution from article: CPU time = 90 ms, elapsed time = 218 ms.
  • My "JOIN" solution above: CPU time = 100 ms, elapsed time = 289 ms.
  • "HAVING SUM" solution suggested by others: CPU time = 711 ms, elapsed time = 954 ms.
  • But at 10 million rows there is a big difference:

  • Jeff's solution from article: CPU time = 701 ms, elapsed time = 911 ms
  • My "JOIN" solution above: CPU time = 4036 ms, elapsed time = 4293 ms.
  • "HAVING SUM" solution suggested by others: CPU time = 6249 ms, elapsed time = 6532 ms.
  • Thanks as always for sharing your knowledge.

  • How about this?

    ; WITH CTE AS

    (

    SELECT P.CustomerID

    , Indicator1 = CASE WHEN P.ProductCode IN('A') THEN 0

    WHEN P.ProductCode IN('C') THEN -1

    END

    , Indicator2 = CASE WHEN P.ProductCode IN('B') THEN 1

    ELSE 0

    END

    FROM #Purchase P

    WHERE P.ProductCode IN ('A','B','C')

    )

    SELECT C.CustomerID

    FROM CTE C

    GROUP BY C.CustomerID

    HAVING SUM (C.Indicator1) = 0 AND SUM(C.Indicator2) > 0

    On a million row table, i get this:

    SQL Server Execution Times:

    CPU time = 93 ms, elapsed time = 294 ms.

    On ten million row, i get this:

    SQL Server Execution Times:

    CPU time = 888 ms, elapsed time = 526 ms.

    {Edit 1 : added ten million statistics}

    {Edit 2 - Removing P.ProductCode, which by the way is not required,from CTE cuts down extra scan}

    {Edit 3: Fixed the bug noted by Toby Harman}

  • {message deleted}

  • Hate to point this out, but ColdCoffee's approach is returning inaccurate results.

    Specifically, it returns customers 6 and 7 from the original test data.

  • Toby Harman (3/29/2012)


    Hate to point this out, but ColdCoffee's approach is returning inaccurate results.

    Specifically, it returns customers 6 and 7 from the original test data.

    That proves why my code was fast 😀 Thanks Toby.

    Edited the code; which runs a touch slower than Jeff's

  • mark hutchinson (3/28/2012)


    @Jeff

    Did you compare the speed of this query against one where your first (Group By) From clause is a Select Distinct (sub) query?

    Nice article. It builds nicely for the reader.

    I love seeing the Except clause being introduced to the reader who, like me, cut our teeth on SQL92 and need to learn newer language features to be more productive.

    Thanks for the feedback, Mark.

    To answer your question, I guess I'd need to see the code example you're talking about because I'm not sure that a SELECT DISTINCT would actually work here.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Viewing 15 posts - 31 through 45 (of 166 total)

    You must be logged in to reply to this topic. Login to reply