Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Post reply

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 76 through 90 (of 990 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 1

Anipaul (5/3/2011)
As usual, you have written another excellent article Jeff. It was great one.:-)

Thanks, Anipaul. Good to "see" you again. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 2

adnan.korkmaz (5/3/2011)
Great job! Thank you all those who have written, contributed or inspired Jeff.
I have done some tests on real data before using this new one, executing both under same parameters for 10.000 times and logging the times.
The previous splitter function we used took 1084, 1986 and 633 milliseconds (average, max and min) on a given parameter. This new one takes 286, 416 and 176 milliseconds (again avg, max, min).
It's around 4 times faster in my scenario.
So, just congrats and thanks!

Very cool feedback, Adnan. Is there a chance of you posting your previous splitter function so we can take a gander at the differences the two may have?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 3

srikant maurya (5/3/2011)
Great Job,
I will use these code for future use.

Thanks, Srikant. I appreciate the feedback.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 4

ALZDBA (5/3/2011)
It's a pity some obscure character caught the attention of CNN, Bloomberg, Al Jazera, ... the day your great article got published.
For sure it would have been a headliner !
Great code, great benchmark info,
Great job Jeff
Thanks again for sharing the knowledge.

Thanks, Johan. Always good to hear from you and thanks for the laugh. 🙂

As a side bar, I've submitted one abstract for PASS so far and will submit another soon. I can't afford to go unless I present because I have to pay for everything out of my own pocket. If I get selected, it would be a real pleasure to meet with you again. Do you know if you're going?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

adnan.korkmaz SSC-Addicted Points: 421 More actions · Answer 5

Jeff Moden (5/3/2011)
adnan.korkmaz (5/3/2011)
Great job! Thank you all those who have written, contributed or inspired Jeff.
I have done some tests on real data before using this new one, executing both under same parameters for 10.000 times and logging the times.
The previous splitter function we used took 1084, 1986 and 633 milliseconds (average, max and min) on a given parameter. This new one takes 286, 416 and 176 milliseconds (again avg, max, min).
It's around 4 times faster in my scenario.
So, just congrats and thanks!
Very cool feedback, Adnan. Is there a chance of you posting your previous splitter function so we can take a gander at the differences the two may have?

Sure, here it is:

ALTER FUNCTION [dbo].[FastSplit] (@s varchar(512))

RETURNS table

AS

RETURN (

WITH Pieces(pn, start, stop) AS (

SELECT 1, 1, CHARINDEX(',', @s-2)

UNION ALL

SELECT pn + 1, stop + 1, CHARINDEX(',', @s-2, stop + 1)

FROM Pieces

WHERE stop > 0

)

SELECT pn,

SUBSTRING(@s, start, CASE WHEN stop > 0 THEN stop-start ELSE 512 END) AS ID

FROM Pieces

)

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 6

Except for the way they handled the final element, that's almost identical to the rCTE code I have in the test harness I included with the article. Thanks for posting it, Adnan. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Johan Bijnens SSC Guru Points: 135377 More actions · Answer 7

Jeff Moden (5/3/2011)
ALZDBA (5/3/2011)
It's a pity some obscure character caught the attention of CNN, Bloomberg, Al Jazera, ... the day your great article got published.
For sure it would have been a headliner !
Great code, great benchmark info,
Great job Jeff
Thanks again for sharing the knowledge.
Thanks, Johan. Always good to hear from you and thanks for the laugh. 🙂
As a side bar, I've submitted one abstract for PASS so far and will submit another soon. I can't afford to go unless I present because I have to pay for everything out of my own pocket. If I get selected, it would be a real pleasure to meet with you again. Do you know if you're going?

Probably I'll apply for travel approval by the end of this month because SQLPass will have published an agenda by then on which I can build my POI that can serve our needs.

I'll keep you updated.

Like in the previous years, as soon as I get my approval it will show up in my forum signature 😎

Fingers crossed :Whistling:

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

Roy Ernest SSC-Dedicated Points: 38780 More actions · Answer 8

As usual, great post Jeff. Five star article.. 🙂

-Roy

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 9

Jeff, please please please write a book, without changing your style. Give us all an opportunity to repay you for what you've done for us.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

mark hutchinson SSC Eights! Points: 878 More actions · Answer 10

Jeff

Thank you for your patience. I must not be asking my question in the right way. Sorry. I'll try one last time.

There are two parts to the parsing:

* finding delimiter positions

* parsing the string

When I do performance comparisons, I time the individual components. I wanted to know the performance difference between using the CharIndex() in a loop and the Tally Table to populate a temp table of delimiter positions.

Although we would both use an "inchworm" method to parse the string, I was curious about the performance of a cursor on the delimiter table, compared to the NullIf(IsNull()).

As I stated earlier, traditional performance tuning primarily involves simplification, reuse, and pre-calculation. If that methodology and approach is not valid in the TSQL environment, I'll gladly get a new toolbox.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 11

mark hutchinson (5/3/2011)
Jeff
Thank you for your patience. I must not be asking my question in the right way. Sorry. I'll try one last time.
There are two parts to the parsing:
* finding delimiter positions
* parsing the string
When I do performance comparisons, I time the individual components. I wanted to know the performance difference between using the CharIndex() in a loop and the Tally Table to populate a temp table of delimiter positions.
Although we would both use an "inchworm" method to parse the string, I was curious about the performance of a cursor on the delimiter table, compared to the NullIf(IsNull()).
As I stated earlier, traditional performance tuning primarily involves simplification, reuse, and pre-calculation. If that methodology and approach is not valid in the TSQL environment, I'll gladly get a new toolbox.

The biggest reason for not using a while loop or cursor to step through each position of a string looking for the specified delimiter is that you cannot use those in an inline TVF (iTVF). You would have to use a multiline TVF (mTVF) and those are slower.

mark hutchinson SSC Eights! Points: 878 More actions · Answer 12

Lynn

Thanks. Then this is an environmental property of 'compiled' TSQL of which I was not aware.

gary.rumble SSCommitted Points: 1923 More actions · Answer 13

Jeff Moden (5/2/2011)
gary.rumble (5/2/2011)
Well, so far I got:
...
I think your code is too much for my server. 😉
Yowch. The code is even split into batches. What you may have to do is run the code a section at a time up to where the test loop begins and the let the test loop rip.
Thanks for trying, Gary. If for some reason, your server just won't take it, let me know and we'll test your code for you.

I set up a new SS instance on my laptop and got the tests to run. Looks like my code tracks your results fairly closely, but of course it doesn't create the numbers table on the fly so it probably has an advantage there.

I emailed you the results. I couldn't generate your pretty graphs, though.

Thanks for the article.

14090 SW TENNESSEE LN

SQLRNNR SSC Guru Points: 281334 More actions · Answer 14

Thanks Jeff - great stuff

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Nadrek SSC-Insane Points: 20039 More actions · Answer 15

Jeff Moden (5/2/2011)
Well, that makes 3 of us. We have a quorum! 😛 Phil Factor wrote an article fairly recently on all the XML, JSON, and other gobilty-gook and suggested that someone should come up with a better way. I was considering writing an article about the all-too-"secret" characters 28-31 and a couple of other goodies in "control character land". Whatcha tink? Worthwhile or not?
I haven't run it but your real Tally Table splitter looks right. You picked up on everything including the use of the "TOP".

Change your test harness script to use CHAR(30) instead of ',' and start leading by example :); examples are what will help with the control characters. Peoplesoft names, valid uses of tabs in text (formatted text blurbs, for example), valid uses of pipes in VARCHAR fields (program documentation, storing REGEX expressions in a database, secure usernames with special characters allowed), etc.

Also, an easy way to get SQL Server to output SELECT results to a special character delimited VARCHAR() table, and directly to a text file would be very worthwhile for the article. The main uses I see are getting data to/from arbitrary SELECT results into a string with one or two delimiters/dimensions, and the same thing to/from a text file.

As far as my previous single CTE real Tally Table (0 based) splitter, please ignore it. It takes too long on higher numbers of elements, much like the original tally table splitter's problem. Instead, try one or both of the two double CTE versions below:

CREATE FUNCTION [dbo].[YourFunctionUnion](@pString [varchar](8000), @pDelimiter [char](1))

RETURNS TABLE

RETURN

WITH

cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT 0 UNION ALL

SELECT TOP (DATALENGTH(ISNULL(@pString,1))) N FROM YourDB.dbo.YourTallyTable1Based WITH (nolock) -- the 1 based + Union All is almost identical in performance to a 0 based TOP (Datalength()+1)

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT t.N+1

FROM cteTally t

WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1), -- not including StringNumber actually increases duration and CPU, but uses slightly fewer writes

Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

FROM cteStart s WITH (nolock)

;

CREATE FUNCTION [dbo].[YourFunction0Based](@pString [varchar](8000), @pDelimiter [char](1))

RETURNS TABLE

RETURN

WITH

cteTally(N) AS (--==== This limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (DATALENGTH(ISNULL(@pString,1))+1) N FROM YourDB.dbo.YourTallyTable0Based WITH (nolock) -- the 1 based + Union All is almost identical in performance to a 0 based TOP (Datalength()+1)

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT t.N+1

FROM cteTally t

WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1), -- not including StringNumber actually increases duration and CPU, but uses slightly fewer writes

Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

FROM cteStart s WITH (nolock)

;

The UNION version is up to about 5% faster on small but reasonably useful numbers of elements and uses the "standard" 1 based tally table, and the 0 based version is up to about 5% faster on medium to large numbers of elements, and uses a special tally table that starts at 0. Both can be slightly slower than your CTE Tally example on very small numbers of elements/string sizes, depending on some combination of SQL version and platform parameters.