March 23, 2012 at 9:38 am
Have a situation here where a developer has written a query that includes a scalar UDF to parse through a couple columns in a table. When omitting this function the query runs in < 5 seconds.
- There are about 24 million rows in this table
- The table contains 2 columns that need to be parsed
- Sometimes there are multiple pieces of data within each ";" extent/delimiter
Here's the column data the function is attempting to parse:
16.245;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
In some cases there are multiple pieces of data within the string:
1;1;75.434;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
In the situation above there 3 chunks of needed data i) 1 ii) 1 iii) 75.434
Attached is 100 rows of table data, first 100 rows of what each query returns, execution plans for both queries (with fx() and without) - My question is how can this function with all it's loops be best optimized?
Here are the queries:
Query without using the functionSELECT
a.[tc-number],
'' as FleetName,
'' as City,
'' as StateProvidence,
'' as DateStarted,
COUNT(DISTINCT a.[tc-card]) AS ActiveCards,
SUM(CASE WHEN [trans-code] in (43,53) THEN 0 ELSE 1 END) AS FuelTrans,
a.[prod-code],a.[prod-qty],'' AS Program,
a.[MonthName],
a.[Year]
FROM dbo.[tmpQTY_LocalFleets_WORK] AS a
WHERE
[pos-date] >= '2012/01/01'
AND [pos-date] <'2012/01/02'
AND [system] = 'PR'
GROUP BY a.[tc-number],
a.[monthName],
a.[Year],a.[prod-code],a.[prod-qty]
Query using the functionSELECT
a.[tc-number],
'' as FleetName,
'' as City,
'' as StateProvidence,
'' as DateStarted,
COUNT(DISTINCT a.[tc-card]) AS ActiveCards,
SUM(CASE WHEN [trans-code] in (43,53) THEN 0 ELSE 1 END) AS FuelTrans,
SUM([dbo].[fx_ArrayList_IP_9378]( a.[prod-code],a.[prod-qty])) as FuelOnlyQTY,
'' AS Program,
a.[MonthName],
a.[Year]
FROM dbo.[tmpQTY_LocalFleets_WORK] AS a
WHERE
[pos-date] >= '2012/01/01'
AND [pos-date] <'2012/01/02'
AND [system] = 'PR'
GROUP BY a.[tc-number],
a.[monthName],
a.[Year]
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 23, 2012 at 9:51 am
March 23, 2012 at 11:29 am
After importing your CSV [testcsvt], I used Jeff's splitter and I was able to sum the prod-qty for each tranid. It's set based so no looping!
With a little tweaking, you should be able to incorporate the splitter function right into your select statement. Might need to use a derived table to make it a little cleaner, but this gives you a start.
sum the quantity for each transaction:
SELECT tranid, SUM(cast(t.item as decimal(14,3))) qty
FROM testcsvt
CROSS APPLY dbo.fn_DelimitedSplit8K([prod-qty],';') t
GROUP by tranid
ORDER by tranid
March 25, 2012 at 8:54 pm
CELKO (3/25/2012)
Remember anything about first normal form? Writing a parser in SQL is like sawing wood with a screwdriver. It is the wrong tool. Do this with a front end language that is meant for that kind of work. Do you have an ETL tool?If you still do not want to do it right, there are several kludges you can Google. I wrote some of the early versions. I can do kludges too! But I try not to.
It's funny that you mention sawing wood with a screwdriver and then recommend some of your own kludges. 😉
The splitter that's been recommended will do the job just fine. And, no... the op probably didn't design the denormalized mess the we've observed. He'd probably like to use a real power saw on this chunk of wood but has been told that he can't go into the next room to use it. He'll just have to settle for a handsaw. :hehe:
By the way... I have to ask. You and I both strongly endorse normalization. How do you feel about storing XML in a table?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 26, 2012 at 7:48 am
CELKO (3/25/2012)
Remember anything about first normal form? Writing a parser in SQL is like sawing wood with a screwdriver. It is the wrong tool. Do this with a front end language that is meant for that kind of work. Do you have an ETL tool?If you still do not want to do it right, there are several kludges you can Google. I wrote some of the early versions. I can do kludges too! But I try not to.
Have you never worked at a company and inherited someone else's work? Perhaps you've had the luxury of having every developer bring their code to you before they start using it but in most cases, most of us have to deal with developers in a different way (to me it's like herding a bunch of cats) - we constantly have to monitor the cra# that's being written and offer suggestions for efficiency and scalability. When something is written poorly and REALLY stands out, we can step in and correct it immediately, however, a typical day around my office is far to hectic to spend the majority of my day babysitting the development team.
Criticisms are also appreciated if they're constructive.
God Lord Celko, you need to have a serious talk with the person that cra#s in your cornflakes daily :Whistling:
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 26, 2012 at 8:50 am
1) if you want to really see what is going on, run a STATEMENT COMPLETED profiler while this query is running. But BEWARE doing this on a production box!!
2) Scalar UDFs are UNBELIEVABLY BAD!! Soooo many ways they can screw you. Check out the chapter I wrote for the SQL Server MVP Deep Dives 2 book entitled "Death by UDF". 🙂
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
March 26, 2012 at 7:45 pm
Okay I'm trying out the tally table but am noticing some peculiar behavior (perhaps this is normal)
Consider this sample row of data:
tranidctrl-numbertc-numberpos-datetc-cardtrans-codeprod-codeprod-qtyMonthMonthNameYearSystem
1154842956|847632|431225|'2012-01-01 00:00:00.000'|'0012 '|31|'2;5;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0'|'9.287;11.51;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0'|1|'January'|2012|'PR'
When I use the functions originally attached to the OP, the result from the developer's function is:
tc-numberActiveCardsFuelTransFuelOnlyQTY
431225|1|120.797
This is arrived with the following TSQLSELECT
a.[tc-number],
COUNT(DISTINCT a.[tc-card]) as ActiveCards,
SUM(CASE WHEN [trans-code] in (43,53) THEN 0 ELSE 1 END) AS FuelTrans,
SUM([dbo].[fx_ArrayList_IP_9378]( a.[prod-code],a.[prod-qty])) as FuelOnlyQTY
FROM dbo.[tmpQTY_LocalFleets_WORK] as a
WHERE
[pos-date] >= '2012/01/01'
and [pos-date] <'2012/01/02'
and [system] = 'PR'and [tc-number] = 431225
GROUP BY a.[tc-number],
a.[monthName],
a.[Year]
When I try the same thing using the Tally Table SELECT
a.[tc-number],
COUNT(DISTINCT a.[tc-card]) as ActiveCards,
SUM(CASE WHEN [trans-code] in (43,53) THEN 0 ELSE 1 END) AS FuelTrans,
SUM(CAST(t.item as decimal(14,3))) qty
FROM dbo.[tmpQTY_LocalFleets_WORK] as a
CROSS APPLY dbo.fx_DelimitedSplit8K([prod-qty],';') t
WHERE
[pos-date] >= '2012/01/01'
and [pos-date] <'2012/01/02'
and [system] = 'PR'
and [tc-number] = 431225
GROUP BY a.[tc-number],
a.[monthName],
a.[Year]
I get:
tc-numberActiveCardsFuelTransqty
431225|1|20|20.797
How on earth is the COUNT(DISTINCT a.[tc-card]) as ActiveCards coming up with a result of 20 when the COUNT should clearly be 1???
Also, in the developer's original function "fx_ArrayList_IP_9378" it looks specifically at the first array list passed in, if the values in the first array list are less than 99 then it will SUM and calculate the values in the second array list, otherwise they are omitted/not calculated.
Consider:
First array list
2;5;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
Second array list
9.287;11.51;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
So...how this should work: because the first two values in the first array list are 2 and 5, the code works correctly by summing up the first two values of the second array list - which equals 20.787 - this is correct
But what if the first array list were:
2;197;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
Then the computed value from the splitter should only be - 9.287 As the second value was above 99 (197)
How can I accomplish this using the tally table?
Apologies if I'm asking to be spoon fed an answer here but we're kinda in a bind on this one and using the original function takes about an hour to returns results for a single month (and we need to go back 36 months) :unsure:
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 26, 2012 at 8:55 pm
UPDATE: I was able to find out why the COUNT was returning 20. Even though there is only one record in the example...because of the CROSS APPLY it is counting the record 20 times (1 for each field in the array)
How on earth is the COUNT(DISTINCT a.[tc-card]) as ActiveCards coming up with a result of 20 when the COUNT should clearly be 1???
work-around?
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 27, 2012 at 2:20 am
You could have a look at where the original data is coming from - if it's via an SSIS package then put a splitting function into that and create extra fields. I did this for our reporting database using a script component in the data flow and the vb string.split command and it's saved a lot of processing.
Agree with you about herding cats though;-)
March 27, 2012 at 2:44 am
CELKO (3/26/2012)
I hate XML inside SQL. I also hate CLRs and anything that is not pure SQL. Mixed code cannot be optimized, requires programmers that know all the assorted languages involved just to read it. Then you have to maintain releases of all of them.
Dear Joe Celko.
This may not be a popular post, but frankly I don't care right now. Whether you like it or not, XML and CLR (and identity) has made its way into SQL Server. You may very well like everything to behave according to old relational theory and the ANSI SQL Standards. I too would like all database engines to understand all ANSI SQL statements, unfortunately that is not the case. Ans as for relational theory, it is sometimes beneficial to denormalize and use surrogate keys to improve performance. While this is so, I focus on doing my job efficiently, rather than focusing on standards and theory.
When I first started reading your books, I got the impression that you were a man of great knowledge on how to use database systems. As I've worked several years with SQL Server now, I am sorry to tell that this has changes. Yes, you have a lot of great knowledge on ANSI SQL, but I have this strange feeling that you do not know how to do things efficiently in SQL Server.
I would be very relieved if you could focus on what people want to achieve, how that can be done as efficiently as possible with this particular database engine, and not focus solely on theory and ANSI Standards.
March 27, 2012 at 6:48 am
>>but I have this strange feeling that you do not know how to do things efficiently in SQL Server.
Absolutely correct - and it shouldn't be a 'strange feeling' either, but rather a common and expected one.
>>I would be very relieved if you could focus on what people want to achieve, how that can be done as efficiently as possible with this particular database engine, and not focus solely on theory and ANSI Standards.
Some of us have been asking him to do that (or to simply stop posting on SQL Server forums completely) for YEARS now.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
March 27, 2012 at 7:44 am
CELKO (3/26/2012)
I hate XML inside SQL. I also hate CLRs and anything that is not pure SQL...
Very puritan's american view :hehe:
That is pure "culture of hate" propaganda...:w00t:
Lets continue hate candidates:
1. I hate integers! because they are not allowing to store decimal numbers!
2. I hate decimals! cause they are not pure enough!
3. I hate having to work with different datatypes. Should be the only one for purity point! (sql_variant everything, why not - simple and nice - has sql word in it)
4. I hate Collation idea! Why so many of them? One would be enough! it just adds to the mix...
5. I hate aggregate functions! Cause there are always cases when you want to use them together with something you don't want to group by or aggregate but it is not allowed!
6. More than anything else I hate ISO standards, because I don't like combinations of letters and numbers! They make me dizzy!
:hehe::hehe::hehe::hehe::hehe::hehe::hehe::hehe:
March 27, 2012 at 8:31 am
P Jones (3/27/2012)
You could have a look at where the original data is coming from - if it's via an SSIS package then put a splitting function into that and create extra fields. I did this for our reporting database using a script component in the data flow and the vb string.split command and it's saved a lot of processing.Agree with you about herding cats though;-)
Unfortunately the data is not coming from SSIS and in all honesty I believe porting it to that method would result in as much time spent working on it as it would just to see if it can be figured out this way 🙁
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 27, 2012 at 10:00 am
Can anyone offer a solution to this issue regarding my post:: Yesterday @ 9:50 AM?
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
March 27, 2012 at 10:48 am
MyDoggieJessie (3/27/2012)
Can anyone offer a solution to this issue regarding my post:: Yesterday @ 9:50 AM?
Can you provide DDL for UDF not as attachment? Not everyone want to download things at work...
Viewing 15 posts - 1 through 15 (of 38 total)
You must be logged in to reply to this topic. Login to reply