Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

Question

Post reply

Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

mister.magoo SSC-Forever Points: 47174 More actions · Answer 1

Hi Jeff,

Great Spackle, thanks!

In my tests, using your code to build a million row test, I found this method to be twice as fast for the same logical reads.

SELECT DISTINCT

CustomerID

FROM #Purchase p1

WHERE ProductCode = 'A'

AND EXISTS (SELECT

1

FROM #Purchase p2

WHERE p2.CustomerID = p1.CustomerID

AND p2.ProductCode = 'B')

AND NOT EXISTS (SELECT

1

FROM #Purchase p3

WHERE p3.CustomerID = p1.CustomerID

AND p3.ProductCode = 'C')

I tested this against 10,000,000 rows as well as 1,000,000 and found that it also scales better (IMHO)...

For 10,000,000 rows, the EXCEPT query had these stats:

Table 'Worktable'. Scan count 49955, logical reads 151393, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 655 ms, elapsed time = 739 ms.

The EXISTS query had these:

Table 'Worktable'. Scan count 3, logical reads 2156, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 406 ms, elapsed time = 478 ms.

I wondered if you considered this method and if so, what it was that steered you away from it?

Thanks

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Wavesailor SSC Enthusiast Points: 101 More actions · Answer 2

Thanks for article and comments - I learnt a lot.

In MySQL 😀 this is the fastest by far:

SELECT

CustomerId

FROM Purchase

WHERE ProductCode IN ('A','B', 'C')

GROUP BY CustomerID

having sum(case when ProductCode = 'A' then 1 else 0 end) > 0

and sum(case when ProductCode = 'B' then 1 else 0 end) > 0

and sum(case when ProductCode = 'C' then 1 else 0 end) = 0

Unfortunately MySQL does not have the EXCEPT operator.

-jj

Jeff Moden SSC Guru Points: 1004412 More actions · Answer 3

Hi folks,

Thanks for the great discussion going on and the thoughtful feedback. I'd love to jump in this instant but I'm on my way to work. I'll be back tonight to try to answer some of the questions.

Thanks, again.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

venoym SSCarpal Tunnel Points: 4161 More actions · Answer 4

Here's another alternative. I haven't tested the performance as I don't have a setup to do that, and I don't know if it will be similar to Jeff's in query plan. It does avoid the COUNT(DISTINCT...) that I read a while back (can't find the linky) was sometimes bad (something about rewinds).

EDIT: Eh, nevermind... you can ignore this one... Did a small test and it's between 5 and 10x slower than other methods with about 2x the reads.

Second Edit: BUT, it's easier to adjust if you know that the situation is a random number of "Purchased these" with a single "but didn't purchase this". Most of the other methods rely on replicating a block of SQL for each item in the list of "Purchased". I think both Jeff's (and mine) work without that restriction.

SELECT

b.CustomerID

FROM

(

SELECT

a.CustomerID

FROM

(

SELECT

DISTINCT

CustomerID,

Product

FROM

Purchase

WHERE

Product IN ('A','B')) AS a

GROUP BY

a.CustomerID

HAVING

COUNT(Product) = 2

) AS b LEFT OUTER JOIN

Purchase p ON b.CustomerID = p.CustomerID AND p.Product = 'C'

WHERE

p.Product IS NULL

sdorris 90134 Right there with Babe Points: 790 More actions · Answer 5

As always, Jeff turns complex problems into simple solutions. I'm still looking for "SQL Spackles" by Jeff Moden at Barnes & Noble.

Jonathan AC Roberts SSCoach Points: 18363 More actions · Answer 6

mister.magoo (3/29/2012)
Hi Jeff,
Great Spackle, thanks!
In my tests, using your code to build a million row test, I found this method to be twice as fast for the same logical reads.
SELECT DISTINCT
CustomerID
FROM #Purchase p1
WHERE ProductCode = 'A'
AND EXISTS (SELECT
1
FROM #Purchase p2
WHERE p2.CustomerID = p1.CustomerID
AND p2.ProductCode = 'B')
AND NOT EXISTS (SELECT
1
FROM #Purchase p3
WHERE p3.CustomerID = p1.CustomerID
AND p3.ProductCode = 'C')

Nice way to do it. I find a lot of developers almost totally overlook use of EXISTS but it can be really efficient.

This is slightly quicker using the results that have both products 'A' and 'B' to check that there is not a 'C' :

SELECT DISTINCT CustomerID

FROM #Purchase p1

WHERE ProductCode = 'A'

AND EXISTS(SELECT 1

FROM #Purchase p2

WHERE p2.CustomerID = p1.CustomerID

AND p2.ProductCode = 'B'

AND NOT EXISTS (SELECT 1

FROM #Purchase p3

WHERE p3.CustomerID = p2.CustomerID

AND p3.ProductCode = 'C'))

Rose Bud Hall of Fame Points: 3147 More actions · Answer 7

Great spackle, Jeff. Very clearly written. Shows the old dogs a new trick and the new dogs some old tricks. Discussion shows there are many ways to skin a cat.

Andre Guerreiro SSCertifiable Points: 7319 More actions · Answer 8

SELECT
CustomerId
FROM Purchase
WHERE ProductCode IN ('A','B', 'C')
GROUP BY CustomerID
having sum(case when ProductCode = 'A' then 1 else 0 end) > 0
and sum(case when ProductCode = 'B' then 1 else 0 end) > 0
and sum(case when ProductCode = 'C' then 1 else 0 end) = 0

That's a very interesting method! I'll take note of it just in case I need something like this.

One question though:

Wouldn't the query below present the same results?

SELECT

CustomerId

FROM Purchase

WHERE ProductCode IN ('A','B', 'C')

GROUP BY CustomerID

having sum(case when ProductCode = 'C' then 1 else 0 end) = 0

Best regards,

Andre Guerreiro Neto

Database Analyst
http://www.softplan.com.br
MCITPx1/MCTSx2/MCSE/MCSA

Thomas LeBlanc SSCrazy Eights Points: 8085 More actions · Answer 9

Thanks Jeff, I never have seen or used the COUNT(DISTINCT <column> ) in the SELECT or HAVING. That is kinda of hard to visualize.

Great Article!!!

Thomas

Thomas LeBlanc, MVP Data Platform Consultant

Jonathan AC Roberts SSCoach Points: 18363 More actions · Answer 10

codebyo (3/29/2012)
One question though:
Wouldn't the query below present the same results?
SELECT
CustomerId
FROM Purchase
WHERE ProductCode IN ('A','B', 'C')
GROUP BY CustomerID
having sum(case when ProductCode = 'C' then 1 else 0 end) = 0

No, it would return someone who as ordered 'A' only or 'B' only

bob.probst+sqlservercentral SSC-Addicted Points: 412 More actions · Answer 11

Yet another solution:

select distinct a.customerid

from #purchase a

inner join #purchase b on (a.customerid = b.customerid)

left join #purchase c on (a.customerid = c.customerid and c.productcode = 'C')

where a.productcode = 'A'

and b.productcode = 'B'

and c.purchaseid is null

Andy DBA SSCommitted Points: 1776 More actions · Answer 12

Nice job, Jeff!

I thought for sure an "old school" approach uing an inner join and a left join exclusion would be faster, but it got blown away in 10M row tests by the article's solution that uses "Except". (1M row tests were still fairly even on my test machine!).

For what it's worth, I also did 10M row tests on some of the other suggestions in this discussion and found the solution first posted by Arjun S (using INTERSECT and EXCEPT) to have the best CPU and elapsed times.

Thanks for a great article (that even included benchmarking code!) and thanks everyone for posting additional solutions.

mtillman-921105 SSCertifiable Points: 7156 More actions · Answer 13

This is a terific article Jeff. I like they way you break down the problem and clearly spell out the thought process. It's most helpful.

The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

vyengr Grasshopper Points: 19 More actions · Answer 14

The EXISTS / NOT EXISTS approach is the one I would have taken. I believe EXISTS is more efficent than IN, or so I was taught for Oracle. I think you can get away without the first EXISTS clause:

Select Distinct CustomerID

FROM #Purchase PU

WHERE ProductCode ='A'

AND EXISTS (Select CustomerID

FROM #Purchase

WHERE ProductCode ='B' and CustomerID = PU.CustomerID)

AND NOT EXISTS (Select PU.CustomerID

FROM #Purchase

WHERE ProductCode ='C' and CustomerID = PU.CustomerID)

Andre Guerreiro SSCertifiable Points: 7319 More actions · Answer 15

Jonathan AC Roberts (3/29/2012)
No, it would return someone who as ordered 'A' only or 'B' only

I guess I may have drank too much coffee today. :doze:

Thank you.

Best regards,

Andre Guerreiro Neto

Database Analyst
http://www.softplan.com.br
MCITPx1/MCTSx2/MCSE/MCSA