October 14, 2019 at 12:00 am
Comments posted to this topic are about the item A User-Defined Function returns more rows
Aveek has been working as an Analytics Engineer for the past couple of years now. His main areas of interests is SQL Server, SSIS, and SSAS. On his leisure, he likes to travel and is also an amateur photographer.
https://twitter.com/avikoleum
https://www.linkedin.com/in/aveekd22/
October 14, 2019 at 12:13 pm
The reason your query returned more results and the fact that it does, due to the DISTINCT clause, should be a red flag that there's something different about the data being returned. The quickest and easiest way to narrow that down, for me anyway, is to copy the 2 sets of data to a spreadsheet and comparing the results to find which record is being repeated and then figure out why. The solution you came up with is very wrong, even though it appears to work. Here's a quick example to show why or where it can go wrong.
Item, Count
ItemA, 1
ItemB, 2
ItemC, 3
Let's say something introduced in the UDF somehow caused these records to be repeated, which results in 5 records.
ItemA, 1
ItemB, 2
It's true that your change will return 3 records but that's because this is what you're getting as a result:
ItemA, 2
ItemB, 4
ItemC, 3
Your total row count is now correct but that's because you instructed SQL to SUM the Count column and that's exactly what it did. The data returned on your count column, however, is incorrect and can lead to corruption or, at the very least, incorrect output.
October 14, 2019 at 5:01 pm
First of all, thank you for taking the time to post your experience and share what you've learned. It's how many of us grow. One thing you may have overlooked is that there is no need for that user function. You can simply put Price * Quantity right inside the SUM() aggregate. In fact, while we're on it. If there's anything I've learned in the 15+ years I've been writing queries, it's this...when you see a user function, ask, "How can I get rid of this?" I think you'll find that in most cases that function needs removed now or in the future so your performance doesn't suffer. While functions are the bread-and-butter of the object oriented programming world, they are often the kryptonite of the SQL Server optimizer. To get started, research RBAR and Set Theory to get the bigger picture of why functions are typically avoided. Again, thanks for taking the time to share!
October 14, 2019 at 5:03 pm
The problem is that once the total price was added, it became part of the DISTINCT evaluation.
There are two rows with 'Car', and "SELECT DISTINCT Product" would return one row for 'Car'.
But with TotalPrice in the SELECT, this is the data:
Product TotalPrice
------------------------------------
Car 2000
Car 500
So DISTINCT sees two different TotalPrice values, and it returns two rows.
As lead DBA at my organization, I have the policy that DISTINCT is not to be used in any production query without DBA approval. In general, I found that DISTINCT was being added as a band-aid to cover up that the developer did not understand why they were getting excessive rows. This is a performance impact; the query is pulling too much data, and then sorting to do the DISTINCT. I teach them techniques to find which join is generating the row multiplication, and how to add the appropriate filter or GROUP BY.
Sincerely,
Daniel
October 14, 2019 at 6:41 pm
Thank you for your comment.
Well, I think the example that you have given is valid and holds good for this scenario, however, my business case was a totally different one, which is not even something near to this. It was just a demonstration of what could go wrong using DISTINCT and a UDF.
Could you please provide your idea of handling such a scenario? Would love to know one.
Thanks,
Aveek
Aveek has been working as an Analytics Engineer for the past couple of years now. His main areas of interests is SQL Server, SSIS, and SSAS. On his leisure, he likes to travel and is also an amateur photographer.
https://twitter.com/avikoleum
https://www.linkedin.com/in/aveekd22/
October 14, 2019 at 6:42 pm
Thanks a lot for providing your suggestions.
Aveek has been working as an Analytics Engineer for the past couple of years now. His main areas of interests is SQL Server, SSIS, and SSAS. On his leisure, he likes to travel and is also an amateur photographer.
https://twitter.com/avikoleum
https://www.linkedin.com/in/aveekd22/
October 14, 2019 at 6:44 pm
Yes, you're correct. In fact, as a developer, I always aim to write code without using any UDFs in the statements. However, in my scenario, we had a lot of discussions with some senior team members before introducing the UDF, because there was no other way we could achieve what we wanted to, without using a UDF.
Aveek has been working as an Analytics Engineer for the past couple of years now. His main areas of interests is SQL Server, SSIS, and SSAS. On his leisure, he likes to travel and is also an amateur photographer.
https://twitter.com/avikoleum
https://www.linkedin.com/in/aveekd22/
October 14, 2019 at 6:48 pm
Without knowing the data or what you're trying to accomplish, we can't really help you. Like 1 person mentioned, using UDF is not recommended as part of the query but that's not always the case. I've done UDF's on a select statement myself, to simplify a complex query. My recommendation is to study the difference in the results and see why the DISTINCT clause is causing that. Sometimes it's easier to just rewrite the entire query, then it is to patch it.
October 14, 2019 at 6:55 pm
Yes, I agree with that.
For obvious reasons, I won't be able to share the data here, but it is a report, which fetches all the tickets that had been sold using a specific subscription. We also wanted to know if there was any discount applied to that ticket, and if yes, what was the campaign and how much was the discount. To sum up, it was a complex necessity to show everything in the same report altogether.
Aveek has been working as an Analytics Engineer for the past couple of years now. His main areas of interests is SQL Server, SSIS, and SSAS. On his leisure, he likes to travel and is also an amateur photographer.
https://twitter.com/avikoleum
https://www.linkedin.com/in/aveekd22/
October 14, 2019 at 7:03 pm
No need for data but if you can share the query, renaming the columns if you want, then it would at least give us a glimpse and help you write a better query.
October 14, 2019 at 8:03 pm
Yes, you're correct. In fact, as a developer, I always aim to write code without using any UDFs in the statements. However, in my scenario, we had a lot of discussions with some senior team members before introducing the UDF, because there was no other way we could achieve what we wanted to, without using a UDF.
A possible better solution would be to rewrite the scalar UDF as an inline-table valued function. Not sure that a function is actually needed - could be a simple CROSS/OUTER APPLY and CTE - but that would require more information than has been provided.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
October 14, 2019 at 11:54 pm
Thanks for an interesting article.
October 15, 2019 at 6:10 am
DISTINCT is a strict prohibition in our team. Any usage of DISTINCT needs to be specifically approved, if at all it is to be used, and it ought to be used with a comment justifying the reason.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply