June 13, 2014 at 4:52 am
Jonathan AC Roberts (6/12/2014)
Normally when an example is given using values A, B and C they should be taken as algebraic variables who's contents can change.
Agreed, it gives a solution that can be used in many situations.
I tested Jeff's original solution and 2 other solutions on a 10-million-row random table (using the code attached in Jeff's article):
DECLARE @a CHAR(1), @b-2 CHAR(1), @C CHAR(1)
SELECT @a = 'X', @b-2 = 'F', @C = 'M'
SET NOCOUNT ON
SET STATISTICS IO ON
SET STATISTICS TIME ON
PRINT '
----- Jeff''s Original -----'
SELECT CustomerID
FROM #Purchase
WHERE ProductCode IN (@A, @b-2)
GROUP BY CustomerID
HAVING COUNT(DISTINCT ProductCode) = 2
EXCEPT
SELECT CustomerID
FROM #Purchase
WHERE ProductCode = @C
---------------------------------
PRINT '
----- NOT EXISTS -----'
SELECT CustomerID
FROM #Purchase P1
WHERE ProductCode IN (@A, @b-2)
AND NOT EXISTS (
SELECT 1
FROM #Purchase P2
WHERE ProductCode = @C
AND P1.CustomerID = P2.CustomerID
)
GROUP BY CustomerID
HAVING COUNT(DISTINCT ProductCode) = 2
---------------------------------
PRINT '
----- SUM(CASE...) -----'
SELECT CustomerID
FROM #Purchase P1
WHERE ProductCode IN (@A, @b-2, @C)
GROUP BY CustomerID
HAVING SUM(CASE ProductCode WHEN @a THEN 1 ELSE 0 END) > 0
AND SUM(CASE ProductCode WHEN @b-2 THEN 1 ELSE 0 END) > 0
AND SUM(CASE ProductCode WHEN @C THEN 1 ELSE 0 END) = 0
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
SET NOCOUNT OFF
Results (cleaned for readability):
----- Jeff's Original -----
Table '#Purchase'. Scan count 49960, logical reads 151394.
SQL Server Execution Times: CPU time = 406 ms, elapsed time = 412 ms.
----- NOT EXISTS -----
Table '#Purchase'. Scan count 3, logical reads 2160.
SQL Server Execution Times: CPU time = 296 ms, elapsed time = 288 ms.
----- SUM(CASE...) -----
Table '#Purchase'. Scan count 3, logical reads 2160.
SQL Server Execution Times: CPU time = 655 ms, elapsed time = 661 ms.
June 13, 2014 at 5:20 am
The Wizard Of Oz (6/13/2014)
I tested Jeff's original solution and 2 other solutions on a 10-million-row random table (using the code attached in Jeff's article):
They are not equivalent and return different result.
COUNT(DISTINCT ...) = 2
and
SUM(CASE WHEN ... THEN ... ELSE ... END) > 0
Will return different results.
N 56°04'39.16"
E 12°55'05.25"
June 13, 2014 at 5:56 am
June 13, 2014 at 6:16 am
You don't need to.
The COUNT(DISTINCT ...) approach will only return the groups that has exactly one A and one B.
The SUM(CASE ...) approach will return all groups having at least one A and at least one B. It will also return the groups having 20 A's and 14 B's.
N 56°04'39.16"
E 12°55'05.25"
June 13, 2014 at 6:28 am
SwePeso (6/13/2014)
The COUNT(DISTINCT ...) approach will only return the groups that has exactly one A and one B.
Not sure about this, I thought the whole point of the DISTINCT is that it collapses any and all duplicates in the group, so it returns all groups having at least one A and at least one B, just like the SUM(CASE ...) approach.
SwePeso (6/13/2014)
The SUM(CASE ...) approach will return all groups having at least one A and at least one B. It will also return the groups having 20 A's and 14 B's.
Yes, I thought this was exactly what we needed?
June 13, 2014 at 6:30 am
So, depending on the distribution of the sample data, the two different queries will return different results.
The COUNT(DISTINCT ... ) will return a fewer number of rows, than the SUM(CASE ...) will.
N 56°04'39.16"
E 12°55'05.25"
June 13, 2014 at 6:54 am
I don't think so, because before that point we've already done our GROUP BY CustomerID, so really the 2 methods we're comparing are:
...
WHERE ProductCode IN (@A, @b-2)
--people that bought A or B or AB
AND NOT EXISTS (
SELECT 1
FROM #Purchase P2
WHERE ProductCode = @C
AND P1.CustomerID = P2.CustomerID
)
--people that bought (A or B or AB) and (not C)
GROUP BY CustomerID
HAVING COUNT(DISTINCT ProductCode) = 2
--people that bought (AB) and (not C)
...
WHERE ProductCode IN (@A, @b-2, @C)
--people that bought (A or B or C or AB or AC or BC or ABC)
GROUP BY CustomerID
HAVINGSUM(CASE ProductCode WHEN @a THEN 1 ELSE 0 END) > 0
AND SUM(CASE ProductCode WHEN @b-2 THEN 1 ELSE 0 END) > 0
AND SUM(CASE ProductCode WHEN @C THEN 1 ELSE 0 END) = 0
--people that bought (AB) and (not C)
Either I need more coffee today (99% likely), or you do Peso (1%).
I'm eager to learn more about SQL either way 🙂
June 13, 2014 at 7:00 am
I'm finding that the results are the same for all 3 queries regardless of the input parameters and, as far as I can see, logically they should be.
SSCrazy, Can you supply some example data to highlight what you are saying here?
I'm pretty sceptical as well and want some evidence that what you are saying is correct.
June 13, 2014 at 7:16 am
As always, an excellent article, Jeff. Then again, we've come to expect nothing less.
June 13, 2014 at 7:17 am
No. You are right.
I am the one needing more coffee. Just a mind lapse and disregarding the DISTINCT for some reason.
N 56°04'39.16"
E 12°55'05.25"
June 13, 2014 at 11:10 am
Ed Wagner (6/13/2014)
As always, an excellent article, Jeff. Then again, we've come to expect nothing less.
Thanks, Ed. Better than the article, though, look at the great discussions going on. Lotsa good people with good ideas. That's why I love this place. Ya just gotta love this community!
--Jeff Moden
Change is inevitable... Change for the better is not.
June 13, 2014 at 4:19 pm
Do we have a readymade 10 million/1million row generator for this problem?
June 13, 2014 at 4:35 pm
Now that we have new analytical functions, just wanted to jump in on the new windowed functions fun 🙂
; WITH CTE AS
(
SELECT P.CustomerID ,P.ProductCode
, CODE = CASE WHEN ( P.ProductCode = 'B'
AND LAG (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) = 'A'
AND LEAD (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) <> 'C' ) THEN 1
ELSE 0
END
FROM #Purchase P
WHERE P.ProductCode in ('A','B', 'C')
GROUP BY P.CustomerID ,P.ProductCode
)
SELECT CustomerID
FROM CTE
June 13, 2014 at 4:48 pm
ColdCoffee (6/13/2014)
Now that we have new analytical functions, just wanted to jump in on the new windowed functions fun 🙂
; WITH CTE AS
(
SELECT P.CustomerID ,P.ProductCode
, CODE = CASE WHEN ( P.ProductCode = 'B'
AND LAG (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) = 'A'
AND LEAD (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) <> 'C' ) THEN 1
ELSE 0
END
FROM #Purchase P
WHERE P.ProductCode in ('A','B', 'C')
GROUP BY P.CustomerID ,P.ProductCode
)
SELECT CustomerID
FROM CTE
What's the performance look like compared to the other methods?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 13, 2014 at 4:51 pm
ColdCoffee (6/13/2014)
Do we have a readymade 10 million/1million row generator for this problem?
Heh... apparently, you didn't read the article where is says... 😉
The code above isn't adequate for performance testing. For those that want to explore and compare solutions of their own, I've attached code to build a million row test table at the bottom of this article in the "Resources" link.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 151 through 165 (of 166 total)
You must be logged in to reply to this topic. Login to reply