March 29, 2012 at 4:07 am
Hi Jeff,
Great Spackle, thanks!
In my tests, using your code to build a million row test, I found this method to be twice as fast for the same logical reads.
SELECT DISTINCT
CustomerID
FROM #Purchase p1
WHERE ProductCode = 'A'
AND EXISTS (SELECT
1
FROM #Purchase p2
WHERE p2.CustomerID = p1.CustomerID
AND p2.ProductCode = 'B')
AND NOT EXISTS (SELECT
1
FROM #Purchase p3
WHERE p3.CustomerID = p1.CustomerID
AND p3.ProductCode = 'C')
I tested this against 10,000,000 rows as well as 1,000,000 and found that it also scales better (IMHO)...
For 10,000,000 rows, the EXCEPT query had these stats:
Table 'Worktable'. Scan count 49955, logical reads 151393, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 655 ms, elapsed time = 739 ms.
The EXISTS query had these:
Table 'Worktable'. Scan count 3, logical reads 2156, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 406 ms, elapsed time = 478 ms.
I wondered if you considered this method and if so, what it was that steered you away from it?
Thanks
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
March 29, 2012 at 4:23 am
Thanks for article and comments - I learnt a lot.
In MySQL 😀 this is the fastest by far:
SELECT
CustomerId
FROM Purchase
WHERE ProductCode IN ('A','B', 'C')
GROUP BY CustomerID
having sum(case when ProductCode = 'A' then 1 else 0 end) > 0
and sum(case when ProductCode = 'B' then 1 else 0 end) > 0
and sum(case when ProductCode = 'C' then 1 else 0 end) = 0
Unfortunately MySQL does not have the EXCEPT operator.
-jj
March 29, 2012 at 6:34 am
Hi folks,
Thanks for the great discussion going on and the thoughtful feedback. I'd love to jump in this instant but I'm on my way to work. I'll be back tonight to try to answer some of the questions.
Thanks, again.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 29, 2012 at 6:48 am
Here's another alternative. I haven't tested the performance as I don't have a setup to do that, and I don't know if it will be similar to Jeff's in query plan. It does avoid the COUNT(DISTINCT...) that I read a while back (can't find the linky) was sometimes bad (something about rewinds).
EDIT: Eh, nevermind... you can ignore this one... Did a small test and it's between 5 and 10x slower than other methods with about 2x the reads.
Second Edit: BUT, it's easier to adjust if you know that the situation is a random number of "Purchased these" with a single "but didn't purchase this". Most of the other methods rely on replicating a block of SQL for each item in the list of "Purchased". I think both Jeff's (and mine) work without that restriction.
SELECT
b.CustomerID
FROM
(
SELECT
a.CustomerID
FROM
(
SELECT
DISTINCT
CustomerID,
Product
FROM
Purchase
WHERE
Product IN ('A','B')) AS a
GROUP BY
a.CustomerID
HAVING
COUNT(Product) = 2
) AS b LEFT OUTER JOIN
Purchase p ON b.CustomerID = p.CustomerID AND p.Product = 'C'
WHERE
p.Product IS NULL
March 29, 2012 at 7:01 am
As always, Jeff turns complex problems into simple solutions. I'm still looking for "SQL Spackles" by Jeff Moden at Barnes & Noble.
March 29, 2012 at 7:10 am
mister.magoo (3/29/2012)
Hi Jeff,Great Spackle, thanks!
In my tests, using your code to build a million row test, I found this method to be twice as fast for the same logical reads.
SELECT DISTINCT
CustomerID
FROM #Purchase p1
WHERE ProductCode = 'A'
AND EXISTS (SELECT
1
FROM #Purchase p2
WHERE p2.CustomerID = p1.CustomerID
AND p2.ProductCode = 'B')
AND NOT EXISTS (SELECT
1
FROM #Purchase p3
WHERE p3.CustomerID = p1.CustomerID
AND p3.ProductCode = 'C')
Nice way to do it. I find a lot of developers almost totally overlook use of EXISTS but it can be really efficient.
This is slightly quicker using the results that have both products 'A' and 'B' to check that there is not a 'C' :
SELECT DISTINCT CustomerID
FROM #Purchase p1
WHERE ProductCode = 'A'
AND EXISTS(SELECT 1
FROM #Purchase p2
WHERE p2.CustomerID = p1.CustomerID
AND p2.ProductCode = 'B'
AND NOT EXISTS (SELECT 1
FROM #Purchase p3
WHERE p3.CustomerID = p2.CustomerID
AND p3.ProductCode = 'C'))
March 29, 2012 at 8:08 am
Great spackle, Jeff. Very clearly written. Shows the old dogs a new trick and the new dogs some old tricks. Discussion shows there are many ways to skin a cat.
March 29, 2012 at 8:14 am
SELECT
CustomerId
FROM Purchase
WHERE ProductCode IN ('A','B', 'C')
GROUP BY CustomerID
having sum(case when ProductCode = 'A' then 1 else 0 end) > 0
and sum(case when ProductCode = 'B' then 1 else 0 end) > 0
and sum(case when ProductCode = 'C' then 1 else 0 end) = 0
That's a very interesting method! I'll take note of it just in case I need something like this.
One question though:
Wouldn't the query below present the same results?
SELECT
CustomerId
FROM Purchase
WHERE ProductCode IN ('A','B', 'C')
GROUP BY CustomerID
having sum(case when ProductCode = 'C' then 1 else 0 end) = 0
Best regards,
Andre Guerreiro Neto
Database Analyst
http://www.softplan.com.br
MCITPx1/MCTSx2/MCSE/MCSA
March 29, 2012 at 8:19 am
Thanks Jeff, I never have seen or used the COUNT(DISTINCT <column> ) in the SELECT or HAVING. That is kinda of hard to visualize.
Great Article!!!
Thomas
Thomas LeBlanc, MVP Data Platform Consultant
March 29, 2012 at 8:35 am
codebyo (3/29/2012)
One question though:
Wouldn't the query below present the same results?
SELECT
CustomerId
FROM Purchase
WHERE ProductCode IN ('A','B', 'C')
GROUP BY CustomerID
having sum(case when ProductCode = 'C' then 1 else 0 end) = 0
No, it would return someone who as ordered 'A' only or 'B' only
March 29, 2012 at 9:09 am
Yet another solution:
select distinct a.customerid
from #purchase a
inner join #purchase b on (a.customerid = b.customerid)
left join #purchase c on (a.customerid = c.customerid and c.productcode = 'C')
where a.productcode = 'A'
and b.productcode = 'B'
and c.purchaseid is null
March 29, 2012 at 9:12 am
Nice job, Jeff!
I thought for sure an "old school" approach uing an inner join and a left join exclusion would be faster, but it got blown away in 10M row tests by the article's solution that uses "Except". (1M row tests were still fairly even on my test machine!).
For what it's worth, I also did 10M row tests on some of the other suggestions in this discussion and found the solution first posted by Arjun S (using INTERSECT and EXCEPT) to have the best CPU and elapsed times.
Thanks for a great article (that even included benchmarking code!) and thanks everyone for posting additional solutions.
March 29, 2012 at 9:28 am
This is a terific article Jeff. I like they way you break down the problem and clearly spell out the thought process. It's most helpful.
The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking
March 29, 2012 at 9:31 am
The EXISTS / NOT EXISTS approach is the one I would have taken. I believe EXISTS is more efficent than IN, or so I was taught for Oracle. I think you can get away without the first EXISTS clause:
Select Distinct CustomerID
FROM #Purchase PU
WHERE ProductCode ='A'
AND EXISTS (Select CustomerID
FROM #Purchase
WHERE ProductCode ='B' and CustomerID = PU.CustomerID)
AND NOT EXISTS (Select PU.CustomerID
FROM #Purchase
WHERE ProductCode ='C' and CustomerID = PU.CustomerID)
March 29, 2012 at 9:34 am
Jonathan AC Roberts (3/29/2012)
No, it would return someone who as ordered 'A' only or 'B' only
I guess I may have drank too much coffee today. :doze:
Thank you.
Best regards,
Andre Guerreiro Neto
Database Analyst
http://www.softplan.com.br
MCITPx1/MCTSx2/MCSE/MCSA
Viewing 15 posts - 16 through 30 (of 166 total)
You must be logged in to reply to this topic. Login to reply