November 2, 2012 at 1:25 pm
Here is a function to do the same thing:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
Create FUNCTION [dbo].[fnParseStringTSQL]
(@string NVARCHAR(MAX),@separator NCHAR(1))
RETURNS @parsedString TABLE (string NVARCHAR(MAX))
AS
BEGIN
DECLARE @position int
SET @position = 1
SET @string = @string + @separator
WHILE charindex(@separator,@string,@position) <> 0
BEGIN
INSERT into @parsedString
SELECT substring(@string, @position, charindex(@separator,@string,@position) - @position)
SET @position = charindex(@separator,@string,@position) + 1
END
RETURN
END
November 2, 2012 at 2:16 pm
If anyone still cares...
I used Jeff's test data (#PostalArea and #CsvTest) and did the split (output = 1,000,000 rows) using the DelimitedSplit8K function and an XML splitter function. The XML method took 2 mins 23 secs whereas the DelimitedSplit8K method took just 19 secs. This was on my local machine which is not a speed demon.
Total Execution Time stats in ms (avg of 5 runs)
RowsXMLSplitter
1,000218.458.8
10,0001299.6131.0
100,0009140.31116.8
Of course, these times will vary from machine to machine.
/* this query took 2 mins 23 secs to run for 1,000,000 rows */
IF OBJECT_ID('tempdb..#CsvSplit2') IS NOT NULL
DROP TABLE #CsvSplit2
SELECT csv.RowNum, split.[ID] AS ItemNumber, split.[Value] AS Abbreviation
INTO #CsvSplit2
FROM #CsvTest csv
CROSS APPLY
(
SELECT [ID],[Value]
FROM dbo.tvfParseDelimitedString(csv.CsvParameter,',')
WHERE [ID] > 0
) split
WHERE
csv.RowNum > 0
SELECT * FROM #CsvSplit2
/* this query took 19 secs to run for 1,000,000 rows */
IF OBJECT_ID('tempdb..#CsvSplit3') IS NOT NULL
DROP TABLE #CsvSplit3
SELECT csv.RowNum, split.ItemNumber, split.Item AS Abbreviation
INTO #CsvSplit3
FROM #CsvTest csv
CROSS APPLY
(
SELECT ItemNumber, Item
FROM dbo.tvfDelimitedSplit(csv.CsvParameter,',')
) split
SELECT * FROM #CsvSplit3
November 2, 2012 at 3:18 pm
HextallFanForLife (11/2/2012)
Here is a function to do the same thing:SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
Create FUNCTION [dbo].[fnParseStringTSQL]
(@string NVARCHAR(MAX),@separator NCHAR(1))
RETURNS @parsedString TABLE (string NVARCHAR(MAX))
AS
BEGIN
DECLARE @position int
SET @position = 1
SET @string = @string + @separator
WHILE charindex(@separator,@string,@position) <> 0
BEGIN
INSERT into @parsedString
SELECT substring(@string, @position, charindex(@separator,@string,@position) - @position)
SET @position = charindex(@separator,@string,@position) + 1
END
RETURN
END
If it has a WHILE loop in it, it's going to be slow.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 2, 2012 at 3:22 pm
peter.row (11/2/2012)
Hey Diego nice article. I understand what you meant about the implied performance concerns and I found Jeff to be very abrasive virtually to the point of being offensive.
It would be interesting to know why you actually thought that. Care to explain further? Seriously... I don't like to offend folks but I do also like to get points across about not using RBAR to do such things.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 3, 2012 at 10:57 am
Jeff, This article is about calling a stored procedure with multiple parameters not for retrieving data from a column with values separated by comma. Do you really believe that it matters if one call with few comma separated values will take 30 milliseconds or 100% more (60 milliseconds)? I agree that there are ways to optimize the parsing methods, but be real, in a year of use you will save 5 seconds processor time and you will spend 1 hour to develop it.
November 3, 2012 at 1:08 pm
Jeff Moden (8/11/2010)
If the data is worth having, it's worth validating.
I'm gonna print that out and stick it on the wall of my office and make people read it to me every time they ask me to write a report that will magically filter out or "correct" all the junk data entered into [insert shoddy application here].
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
November 3, 2012 at 5:35 pm
Peter Di (11/3/2012)
Jeff, This article is about calling a stored procedure with multiple parameters not for retrieving data from a column with values separated by comma. Do you really believe that it matters if one call with few comma separated values will take 30 milliseconds or 100% more (60 milliseconds)? I agree that there are ways to optimize the parsing methods, but be real, in a year of use you will save 5 seconds processor time and you will spend 1 hour to develop it.
I had precisely that line from a colleague of mine. About a procedure doing just such a simple parameter strip.
Sounds like a compelling argument? Not when the procedure was being run a couple million times per day.
And yes, the greater performance issue in this case was poor application design, that it was being run more frequently than necessary. But the point is that even micro-performance counts, depending on the context.
November 3, 2012 at 8:31 pm
Peter Di (11/3/2012)
Jeff, This article is about calling a stored procedure with multiple parameters not for retrieving data from a column with values separated by comma. Do you really believe that it matters if one call with few comma separated values will take 30 milliseconds or 100% more (60 milliseconds)? I agree that there are ways to optimize the parsing methods, but be real, in a year of use you will save 5 seconds processor time and you will spend 1 hour to develop it.
Very good questions, Peter. Seriously. And, apologies for the long winded answer but good questions deserve thoughtful answers.
First, my objections weren't just about time/duration. There was nothing in the code to identify when one or more elements of the passed parameter weren't in the comparison table. The design was missing the necessary validation and feedback. That, of course, could be easily fixed but that would also eat into the hour you're talking about. If it wasn't found, it could cost much more than time as a missed error depending on what it was actually doing.
Shifting gears back to the focus of the time related questions.
I agree. One call with a few comma separated values against a relatively tiny table that takes 60 or even 600 millseconds isn't going to matter in the grand scheme of an application... until it's no longer one call. If we take the lower number of just 30 milliseconds and put it up against a higher usage application where it might be used a thousand times per second, we're suddenly talking about 30 seconds of CPU time per second. In broader terms, that will take 30 minutes of CPU time for every hour for this simple task when it should take nearly zero minutes per hour. If the method is used in many places in the app for each call, you're suddenly talking about a whole lot of CPU time being used for something so simple. Yeah, I know. "Hardware is cheap". We'll talk about that at the end of this.
Ok... agreed. Not everyone is going to have that frequency (1.000 times per second) of usage for an app and it might, in fact, only be used once in the app. As you correctly point out (I'm right there with you on this), is it really worth spending an hour on to make it take only a couple of milliseconds (hopefully, less) instead of 30? The answer is patently "No" but not quite the way you might be thinking. Developers shouldn't have to spend but a minute or two on something like this because they should already know that nearly every app is going to need to handle such things and, unless they're a rank newbie, they should have already studied what the best methods are (combination of 100% accuracy and excellent performance) and have it ready to copy and paste from their library. I guess that's a personal gripe of mine. A lot of people don't study for the job they're supposed to be doing.
Now, let's get to the more insidious side of this. Low usage, low row counts, and supposedly low frequency should never be used as an excuse for "good enough" code for several reasons. First and like I pointed out, a Developer should already know what the fast stuff really is and have it at the ready for CPR (Copy, Paste, Replace). Second, there are those developers who haven't taken quite that interest in their job so when they're hit with a tight schedule, they'll use anything they can find so long as it works on one to ten rows (or calls) never giving consideration to the fact that their work will have to scale.
I just went through this at work, again, with code very similar to that in the article but backwards in flow of data. Someone wrote some code against a table that everyone "knew" wasn't going to grow by much. The table was supposed to have 28 rows in it and it wasn't likely that it would ever even double. They were right. After 2 years, it should have only 51 rows in it. The trouble is that they didn't want to maintain the table manually so as the number of possible selections started to grow, they used the very same method they previously used (to save development time)but against a data table. When I discovered the problem, the data table had grown to 4.7 million rows and they had "thoughtfully" added a DISTINCT to create the lookup list with a UNION against the original table. The code was only used 1,000 times per hour (low frequency, right?) but it had grown (to >2 seconds per call) to using a total of 40 CPU minutes and more than an eighth of a Terabyte in reads per hour... to return just 51 rows for a drill down menu that was only used 1,000 times per hour.
Like I said, it had grown to taking slightly more than 2 seconds to return. Let's see… it gets used by our internal people 1,000 times per hour times 12 hours per day (extended coverage phone center), every 21.7 working days of the month. That's over 144 FTE hours per month that we're paying people to wait for a pulldown menu to load. Considering nearly linear growth of the data table over the two years the code had existed, that's 1,800 hours of wait time we paid people for. Even if a Developer wasn't prepared to handle this eventuality, that's a hell of a trade off compared to the 1 hour of development time you were talking about.
It only took me a half hour to discover this problem. It took me about an hour to write the code to fix the problem. The reason why it took so long is because I didn't want someone to have to fix any front-end code to support the fix so it had to be 100% transparent to the GUI and it had to continue to be self maintaining. Just adding the rows to the original table wasn't going to hack that little requirement. Then, it had to go through QA testing where they had to not only verify that the screen was still working correctly but that the underlying data to build the menu was being interpreted correctly. That took one person an hour because they're not database people. In order for them to test what I had done, I had to spend another half hour writing up what I had done and how it worked. That's another 3 hours spent on a problem that could have been avoided by spending an hour to do it right the first time.
For the record, I got each call down to where it belongs at about 200 microseconds and the total number of reads down to just 16 megabytes per hour. There was no rocket science involved in the fix, either. Any developer with just a couple of years of experience could have pulled the same thing off in about an hour of original development time.
So, with apologies to my good friend, Kevin Boles, and to answer your question as to whether or not I believe people should worry about a small comma delimited list taking "just" 30 milliseconds to be used on a relatively infrequent basis against a relatively small number of rows, my answer is a resounding "YES THEY SHOULD!" even if takes them an hour because you don't know how the method will eventually be used by someone. A whole lot of managers really need to learn this particular lesson when they write a development schedule. Compare the one hour invested to the 1,800 hours of wasted employee time and the several hours coming up with and testing a fix.
Why did I apologize to Kevin for this? Because he's an expert performance tuning consultant that gets paid big bucks to fix things like this. There goes more time and money towards something that could have only taken an hour to do right the first time or even the second time they did it… or even just a couple of minutes by someone who knows to expect and is prepared to handle these types of development requirements as a part of their everyday job.
Will the particular method in the article ever be exposed to so much growth? Maybe not but you just don't know for sure because requirements change. Plan for the worst so that if it does happen, you don't have to make any repairs just because scale increased. Bullet proof code just doesn't take that much longer to write and it's worth every penny down the road.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 4, 2012 at 1:01 am
Damn! I want that on a sign above my desk. 😀 Guess I'll have to settle for a bookmark.
November 4, 2012 at 11:36 am
GPO (11/3/2012)
Jeff Moden (8/11/2010)
If the data is worth having, it's worth validating.I'm gonna print that out and stick it on the wall of my office and make people read it to me every time they ask me to write a report that will magically filter out or "correct" all the junk data entered into [insert shoddy application here].
😀
--Jeff Moden
Change is inevitable... Change for the better is not.
November 4, 2012 at 2:53 pm
Speaking of garbage in - garbage out. In case when I have to split a CSV into a list of items I spend at least one hour trying to understand why some substrings still not spltted until I realised that end-users were able to choose ANY character as a separator.
So, here we go again - no proper DESIGN on any layer, starting from UI. And by the way source data store was not a LEGACY system - it was
One of the brand new vendor's packaged solutions.
November 5, 2012 at 1:21 am
Jeff Moden (11/2/2012)
peter.row (11/2/2012)
Hey Diego nice article. I understand what you meant about the implied performance concerns and I found Jeff to be very abrasive virtually to the point of being offensive.It would be interesting to know why you actually thought that. Care to explain further? Seriously... I don't like to offend folks but I do also like to get points across about not using RBAR to do such things.
I think I was having a pre-tea early morning moment and apologise (ironically) for being overly harsh.
I guess it was mainly that people were being chastised for not doing stuff based on assumptions on what their use case might be.
November 5, 2012 at 5:23 am
peter.row (11/5/2012)
Jeff Moden (11/2/2012)
peter.row (11/2/2012)
Hey Diego nice article. I understand what you meant about the implied performance concerns and I found Jeff to be very abrasive virtually to the point of being offensive.It would be interesting to know why you actually thought that. Care to explain further? Seriously... I don't like to offend folks but I do also like to get points across about not using RBAR to do such things.
I think I was having a pre-tea early morning moment and apologise (ironically) for being overly harsh.
I guess it was mainly that people were being chastised for not doing stuff based on assumptions on what their use case might be.
Thanks for the feedback, Peter. I sometimes have the same problem with coffee. 🙂
Just to clear the air a bit, it's not the immediate use case that I always worry about. It's how other people might use it for something else. It's not always clear to others how badly something can turn because they've just not seen such a thing happen, yet, and, as you saw in my recent previous post, it can take a whole lot of typing trying to avoid sounding harsh.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 5, 2012 at 8:15 am
Jeff, I agree with you. Having thousands or millions of calls to a procedure like this will be a problem. However stored procedures with multiple parameters are not used widely and usually have much slower part after the parsing. Also, as you can see in the article, there are other ways to send values instead of using comma separated strings ... so optimizing the parsing is not the thing I will start with.
Anyway , you are right that there is a faster way to parse the values and it will be stupid to ignore it ( especially if you already tested and prepared the functions for an easy implementation) . If you don't mind I will update my article to include one of the methods in your test and to include a link to the full article.
November 5, 2012 at 8:16 am
Just at the time when we're trying to find a way to improve our data import this article comes along. Perhaps it is a God-send.
While your article does not answer all of the questions it may provide a jumping-off point to leap to a solution -- after we put our heads to our challenge. Our current method ties up the computer for an hour and a half bringing in the data and we'd like something faster.
Here is our challenge:
We receive multiple tables of data in a pipe-delimited text file.
Each table begins with 2 rows that identify the columns (one row with a column "name", the second row with a "field-code" for the column).
Thereafter come the data rows until we encounter the next TABLE headers.
The method in this article may pave the way to (1) parse the table headers to build temporary tables, and then (2) insert the table data into the temporary table. The method may also lead us to a way to extract selected columns for insertion into the final data tables.
Thanks, again, for this timely article (at least for us it's timely).
Norm Johnson
"Keep smiling ... it gives your face something happy to do
... and it makes people wonder what you're up to!"
Viewing 15 posts - 106 through 120 (of 124 total)
You must be logged in to reply to this topic. Login to reply