Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

Question

Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

Jeff's solution from article: CPU time = 90 ms, elapsed time = 218 ms.
My "JOIN" solution above: CPU time = 100 ms, elapsed time = 289 ms.
"HAVING SUM" solution suggested by others: CPU time = 711 ms, elapsed time = 954 ms.

But at 10 million rows there is a big difference:

Jeff's solution from article: CPU time = 701 ms, elapsed time = 911 ms
My "JOIN" solution above: CPU time = 4036 ms, elapsed time = 4293 ms.
"HAVING SUM" solution suggested by others: CPU time = 6249 ms, elapsed time = 6532 ms.

Thanks as always for sharing your knowledge.

Andre Guerreiro SSCertifiable Points: 7319 More actions · Answer 1

I see that most examples are accessing the same table more than once (IN/NOT IN, EXISTS/NOT EXISTS or JOINs). I wonder if it's faster than acessing the table only once and then, using HAVING to SUM() three times.

Is there any particular reason to avoid NOT IN or NOT EXISTS in this case?

Best regards,

Andre Guerreiro Neto

Database Analyst
http://www.softplan.com.br
MCITPx1/MCTSx2/MCSE/MCSA

slsmithstone SSC-Addicted Points: 431 More actions · Answer 2

slsmithstone

SSC-Addicted

Points: 431

March 29, 2012 at 10:40 am

#1466492

Great Article! Thanks.

Antares686 SSC Guru Points: 125444 More actions · Answer 3

ColdCoffee (3/28/2012)
tommyh (3/28/2012)
Wouldnt something like this work just as well?
SELECT
CustomerId
FROM #Purchase
GROUP BY CustomerID
having sum(case when ProductCode = 'A' then 1 else 0 end) > 0
and sum(case when ProductCode = 'B' then 1 else 0 end) > 0
and sum(case when ProductCode = 'C' then 1 else 0 end) = 0
/T
Yep, this is how i once implemented it! Cant remember the thread now.
Here it is: http://www.sqlservercentral.com/Forums/FindPost1267224.aspx
Proved to elimate the EXCEPT part; also does a single scan on the table.

Was going to say that one as well. And the stats come up better when I test. Also, becuase the having clause meets the 3 required conditions you don't need the WHERE clause so tweak it there.

msmithson Ten Centuries Points: 1171 More actions · Answer 4

Nice, simple, and informative (especially the EXCEPT which I never use). the alternatives were also interesting.

Antares686 SSC Guru Points: 125444 More actions · Answer 5

Antares686 (3/29/2012)
ColdCoffee (3/28/2012)
tommyh (3/28/2012)
Wouldnt something like this work just as well?
SELECT
CustomerId
FROM #Purchase
GROUP BY CustomerID
having sum(case when ProductCode = 'A' then 1 else 0 end) > 0
and sum(case when ProductCode = 'B' then 1 else 0 end) > 0
and sum(case when ProductCode = 'C' then 1 else 0 end) = 0
/T
Yep, this is how i once implemented it! Cant remember the thread now.
Here it is: http://www.sqlservercentral.com/Forums/FindPost1267224.aspx
Proved to elimate the EXCEPT part; also does a single scan on the table.
Was going to say that one as well. And the stats come up better when I test. Also, becuase the having clause meets the 3 required conditions you don't need the WHERE clause so tweak it there.

Interesting fact here which can be explained by filtering of data I need by keeping the WHERE clause has a big impact on the query.

WHERE ProductCode In ('A','B','C')

Also, of note I tried the EXCEPT version and the EXISTS/NOT EXISTS using the process to generate a lot more rows but did not add the index on purpose and the EXCEPT and HAVING both came in much better over EXISTS, which complained in the Execution plan about a missing index. The other two did not complain.

ejoell 66477 Right there with Babe Points: 778 More actions · Answer 6

Or you could do

Select Distinct ABuyers.CustomerID from

(Select CustomerID FROM #Purchase WHERE ProductCode = 'A') ABuyers

JOIN

(Select CustomerID FROM #Purchase WHERE ProductCode = 'B') BBuyers

ON ABuyers.CustomerID = BBuyers.CustomerID

LEFT JOIN

(Select CustomerID FROM #Purchase WHERE ProductCode = 'C') CBuyers

ON ABuyers.CustomerID = CBuyers.CustomerID

WHERE CBuyers.CustomerID is null

Ken SSCoach Points: 17191 More actions · Answer 7

Nice topic Jeff. Thanks for the article. I have seen this type of query written using the having + sums method, left join and exists/not exists methods but have never tested which one is best. I personally like the Exists/Not Exists method just for readability even though it sounds like it is less efficient than the having method unless the extra index is applied. I always forget about intersect and except as it seems like those methods are about the same as exists/not exists.

jshahan SSCarpal Tunnel Points: 4622 More actions · Answer 8

I've been faced with this problem many times and never adopted a standard methodology for dealing it with but am quite sure that my attemps were RBAR.

Would this approach be considered set based?

(please don't yell at me if my question was covered in an previous post. I didn't read all of them.)

Antares686 SSC Guru Points: 125444 More actions · Answer 9

jshahan (3/29/2012)
I've been faced with this problem many times and never adopted a standard methodology for dealing it with but am quite sure that my attemps were RBAR.
Would this approach be considered set based?
(please don't yell at me if my question was covered in an previous post. I didn't read all of them.)

Always confirm your standard approach, I have found on occasions that one approach may perform great and in another case a variant query works better. So this is one more tool for my options to consider.

David Rueter SSCrazy Points: 2661 More actions · Answer 10

Thanks, Jeff.

I thought my standard pattern of simple joins would do about as well:

SELECT DISTINCT p.CustomerID

FROM

#Purchase p

JOIN #Purchase p2 ON

p.CustomerID = p2.CustomerID AND

p2.ProductCode = 'B'

LEFT JOIN #Purchase p3 ON

p.CustomerID = p3.CustomerID AND

p3.ProductCode = 'C'

WHERE

p.ProductCode = 'A' AND

p3.CustomerID IS NULL

Indeed, at 1 million rows performance is similar:

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 11

How about this?

; WITH CTE AS

(

SELECT P.CustomerID

, Indicator1 = CASE WHEN P.ProductCode IN('A') THEN 0

WHEN P.ProductCode IN('C') THEN -1

END

, Indicator2 = CASE WHEN P.ProductCode IN('B') THEN 1

ELSE 0

END

FROM #Purchase P

WHERE P.ProductCode IN ('A','B','C')

)

SELECT C.CustomerID

FROM CTE C

GROUP BY C.CustomerID

HAVING SUM (C.Indicator1) = 0 AND SUM(C.Indicator2) > 0

On a million row table, i get this:

SQL Server Execution Times:

CPU time = 93 ms, elapsed time = 294 ms.

On ten million row, i get this:

SQL Server Execution Times:

CPU time = 888 ms, elapsed time = 526 ms.

{Edit 1 : added ten million statistics}

{Edit 2 - Removing P.ProductCode, which by the way is not required,from CTE cuts down extra scan}

{Edit 3: Fixed the bug noted by Toby Harman}

David Rueter SSCrazy Points: 2661 More actions · Answer 12

David Rueter

SSCrazy

Points: 2661

March 29, 2012 at 2:34 pm

#1466623

{message deleted}

Toby Harman SSCarpal Tunnel Points: 4168 More actions · Answer 13

Hate to point this out, but ColdCoffee's approach is returning inaccurate results.

Specifically, it returns customers 6 and 7 from the original test data.

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 14

Toby Harman (3/29/2012)
Hate to point this out, but ColdCoffee's approach is returning inaccurate results.
Specifically, it returns customers 6 and 7 from the original test data.

That proves why my code was fast Thanks Toby.

Edited the code; which runs a touch slower than Jeff's

Jeff Moden SSC Guru Points: 1004546 More actions · Answer 15

mark hutchinson (3/28/2012)
@Jeff
Did you compare the speed of this query against one where your first (Group By) From clause is a Select Distinct (sub) query?
Nice article. It builds nicely for the reader.
I love seeing the Except clause being introduced to the reader who, like me, cut our teeth on SQL92 and need to learn newer language features to be more productive.

Thanks for the feedback, Mark.

To answer your question, I guess I'd need to see the code example you're talking about because I'm not sure that a SELECT DISTINCT would actually work here.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

Cookies on SQLServerCentral