June 6, 2014 at 9:50 am
SwePeso (6/6/2014)
SELECT CustomerID
FROM #Purchase
WHERE ProductCode IN ('A', 'B', 'C')
GROUP BY CustomerID
HAVING MIN(ProductCode) = 'A'
AND MAX(ProductCode) = 'C'
AND SUM(CASE WHEN ProductCode = 'B' THEN 1 ELSE 0 END) = 0;
Avoid DISTINCT. It is a huge performance killer.
Nifty trick, tack PeSo π
June 6, 2014 at 11:51 am
Thank you for this filler Jeff, good read as always, and amazing thread! Only ran your initial code, any thoughts on how to eliminate the second scan?
π
June 6, 2014 at 12:05 pm
Peter,
I think we wanted customers who bought A and B but not C, so your query will be even simpler,
MIN(productID) = 'A' and MAX(productID) = 'B'
June 6, 2014 at 12:09 pm
Naomi N (6/6/2014)
Peter,I think we wanted customers who bought A and B but not C, so your query will be even simpler,
MIN(productID) = 'A' and MAX(productID) = 'B'
What if we had a product 'D'?
Or if we wanted customers who bought A and C but not B?
June 6, 2014 at 12:20 pm
Luis Cazares (6/6/2014)
Naomi N (6/6/2014)
Peter,I think we wanted customers who bought A and B but not C, so your query will be even simpler,
MIN(productID) = 'A' and MAX(productID) = 'B'
What if we had a product 'D'?
Or if we wanted customers who bought A and C but not B?
SUM(case ) solution will always work and was already proposed. Basically, if we want A & B and not C, then
select CustomerID
from Orders
WHERE ProductID IN ('A','B','C)
GROUP BY CustomerID
HAVING MIN(ProductID) = 'A' and MAX(ProductID) = 'B'
---------------
If we want A & C, but not B, then add SUM(case ...) condition
In any case, using these simple algorithms will be more efficient than using COUNT(distinct)
--------------
If you check my article T-SQL Relational Division it solves more complex problem. In my case I was trying to find customers who made same (or almost the same) orders regardless of what products they bought.
June 6, 2014 at 12:27 pm
Naomi N (6/6/2014)
Luis Cazares (6/6/2014)
Naomi N (6/6/2014)
Peter,I think we wanted customers who bought A and B but not C, so your query will be even simpler,
MIN(productID) = 'A' and MAX(productID) = 'B'
What if we had a product 'D'?
Or if we wanted customers who bought A and C but not B?
SUM(case ) solution will always work and was already proposed. Basically, if we want A & B and not C, then
select CustomerID
from Orders
WHERE ProductID IN ('A','B','C)
GROUP BY CustomerID
HAVING MIN(ProductID) = 'A' and MAX(ProductID) = 'B'
---------------
If we want A & C, but not B, then add SUM(case ...) condition
In any case, using these simple algorithms will be more efficient than using COUNT(distinct)
--------------
If you check my article T-SQL Relational Division it solves more complex problem. In my case I was trying to find customers who made same (or almost the same) orders regardless of what products they bought.
Yes, but not more efficient than the intersect/except clauses proposed previously on this discussion. π
June 6, 2014 at 12:40 pm
Naomi N (6/6/2014)
Peter,I think we wanted customers who bought A and B but not C, so your query will be even simpler,
MIN(productID) = 'A' and MAX(productID) = 'B'
See page 10.
N 56Β°04'39.16"
E 12Β°55'05.25"
June 6, 2014 at 2:12 pm
SwePeso (6/6/2014)
Naomi N (6/6/2014)
Peter,I think we wanted customers who bought A and B but not C, so your query will be even simpler,
MIN(productID) = 'A' and MAX(productID) = 'B'
See page 10.
Page 10? I only have 3 pages. π
June 6, 2014 at 2:18 pm
Lynn Pettis (6/6/2014)
SwePeso (6/6/2014)
Naomi N (6/6/2014)
Peter,I think we wanted customers who bought A and B but not C, so your query will be even simpler,
MIN(productID) = 'A' and MAX(productID) = 'B'
See page 10.
Page 10? I only have 3 pages. π
I have 5. :w00t:
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 β Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
June 6, 2014 at 2:45 pm
Sean Lange (6/6/2014)
Lynn Pettis (6/6/2014)
SwePeso (6/6/2014)
Naomi N (6/6/2014)
Peter,I think we wanted customers who bought A and B but not C, so your query will be even simpler,
MIN(productID) = 'A' and MAX(productID) = 'B'
See page 10.
Page 10? I only have 3 pages. π
I have 5. :w00t:
I have 13 π BTW, is there a way to receive only 'interesting' replies notifications, not every silly reply? π Or include the actual text of the message in the notification?
June 6, 2014 at 10:19 pm
How about using bitwise checking?
select customerid,
sum(distinct case
when productcode='A' then 1
when productcode='B' then 2
when productcode='C' then 4
else 0 end)
from #purchase
group by customerid
having sum(distinct case
when productcode='A' then 1
when productcode='B' then 2
when productcode='C' then 4
else 0 end) = 3
π
June 6, 2014 at 10:52 pm
I think a lot of people seem to focus purely on performance, a good solution to a given issue considers many factors, NOT just performance, people who focus on just the performance aspect tend to write vastly more code and it eventually gets messy, I've seen this many times, stored procedures with vast swathes of if/then/else statements separating complete SQL statements and every time someone requests something even slightly different a new if/then/else and query is added, eventually leading to exceedingly messy code. If you need to add another column to the output youβre changing many SQL statements.
The truth is sometimes a little bit of performance needs to be sacrificed for cleaner and more flexible code, it will make your life easier and unless there is a vast difference the user/consumer won't care. What they will care about is how long it takes to change the query to return the data with slightly different criteria. I have worked as a contractor for many years and when I see these huge if/then/else structures, as I have many times, I do not think the developer was all that clever at all, unless there was absolutely no choice, they have just made it harder for themselves and the next guy who looks at the code. And before anyone has the idea of building these queries dynamically to cater for different input parameters, just be aware this is the absolute last resort because it is inefficient and the query will be compiled when executed and dependencies cannot be considered by SQL Server, not when you deploy the procedure, so add that compile time and the building of the query to your performance and see where it stands.
All of the solutions here are valid in specific situations. If this query were behind a report there is a distinct possibility that just returning the CustomerID would not be enough and they would want some figures and some customer information returned, never forget the end goal nor the people who will be consuming the data, who will ALWAYS ask for the same thing but sliced slightly differently tomorrow. How long is it going to be before someone asks for a filter on product "E" or where βCβ is greater than 5 but less than 10? They would/should expect these things to be parameterised, not have to ask for a new report. Who wants to write 10 reports when they could write 1 with parameters which allow some flexibility, I like my life and the life of the guy who looks at the code after me to be easy not hard.
This problem is obviously one which deals with sets of data so the most obvious solution is to use INTERSECT/EXCEPT/UNION/ETC..., the set operators but is this the optimal solution, well maybe, depends what you are going to do with the results and how flexible the criteria needs to be.
The trick is to weigh up the pros/cons of all the possible solutions and pick the best one for the given situation, NOT to just say this one is faster or this one has less characters to type so I'm going to use that. You should always come up with more than one solution to a given problem if possible, this problem has many solutions, as we have seen. The right one is the one which considers all the factors of good coding and serviceability, which is always completely dependent on the situation/usage.
Many times I have had to pull up recent graduates and ask them what they are going to do when the user asks for something slightly different, the response is always "but the spec says they want this", you need look past specifications and add some real world knowledge to the solution, give them what they want but also consider what they are likely to want tomorrow. That's why I threw in the PIVOT example, which is quite out of the square for the given problem and the other solutions which donβt use the set operators are excellent, the obvious solution isnβt always the right solution, under certain circumstances, not all circumstances, these could prove to be the right solution. The PIVOT solution would be vastly easier to alter the criteria for and a hell of a lot easier to read than say 10 queries separated by set operators. The downside is the number of columns for the pivot are set and the performance is not quite as good. Like I said, weigh up the pros/cons and choose the right solution which will work best in the long run for you considering all the factors, NEVER just performance.
Sorry for being so verbose π
June 6, 2014 at 11:59 pm
SimonC, I have to disagree with you here.
Even if the PIVOT solution seem to work well, it doesn't scale well. And that is one of the cardinal sins junior developers tend to make. If you make a PIVOT solution in your test environment (which may have 100 or 1000 rows), you will halt the server when deploying your solution to the production server which may have a million or more rows.
I have seen this over and over again. And then some more times again.
It's a pity you despise performance written code, in favor of "simpler to understand code". With that mindset and attitude, junior developers will never be good developers, and some of them have potential to be excellent developers.
Why don't give them a chance to evolve and actually write code that has a chance in the real world?
The PIVOT solution you gave is also specific for this problem statement. If you want a "one solution rule them all", you would need to learn relational division (which happens to be one of my specialities). The problem with the PIVOT solution is that you introduce the three-tiered logic with NULL, since PIVOT aggregates NULL for unknown values, not 0. If you write the PIVOT solution with the good old-fashioned SUM(CASE ... WHEN ... THEN ... END), you are safe for that to, but then the solution is so close to the other solutions.
My $0.02
N 56Β°04'39.16"
E 12Β°55'05.25"
June 7, 2014 at 1:03 am
Hi SSCrazy,
Thanks for your input.
Actually, I didn't say that I don't like performance written code what I am getting at is that performance is not the only consideration, sacrificing a few milli-seconds of performance for code that's readable and more flexible is an acceptable trade-off. And I don't despise junior developers, don't know where you got that from that's a bit extreme, I just want them, and everyone else, to consider more than one solution to a given problem and that they should look at other alternatives as well, not just the most obvious and choose the best for the given situation taking into account not just performance. Performance is a major consideration if there is a vast difference.
eg) using 200 lines of code that runs 10 milliseconds faster than 10 lines of readable and flexible code behind a report that is run maybe twice a day is not a good thing, I'll take the 10 lines of code any day. Like I said it all depends on the situation.
As for the PIVOT example, I didn't say it was the βone solution to rule them allβ, there isnβt one, just another alternative, like Relational Division and Set Operators all should be considered and the most appropriate chosen for the given situation.
I did do the test on a million rows, however no query is an island and in a production situation has to contend with other process/queries running at the same time, so in a given situation you may be correct, this would need to be tested. I ran this on my home computer with nothing else running at the same time but I'm thinking so did everyone else, I doubt anyone ran their examples on their production server.
June 7, 2014 at 5:53 am
Hello All
I'm just wondering, is there anything wrong with doing it this way:
SELECT DISTINCT
P.CustomerID
FROM #Purchase P
LEFT JOIN (
SELECT
P.CustomerID
FROM Purchase P
WHERE
P.ProductCode = 'C'
) NP ON NP.CustomerID = P.CustomerID
WHERE
P.ProductCode IN ( 'A', 'B' )
AND NP.CustomerID IS NULL
This seems a lot simpler to me and is the way I would probably have done it but are the performance implications in comparison with other methods demonstrated?
Thanks
Robin
Viewing 15 posts - 121 through 135 (of 166 total)
You must be logged in to reply to this topic. Login to reply