String splitter function for mutli-char delimeters

Question

Post reply

String splitter function for mutli-char delimeters

g.britton

SSChampion

Points: 13863
More actions
November 5, 2014 at 7:26 am

#397610

I've been using Jeff Moden's excellent Tally OH splitter for a while
http://www.sqlservercentral.com/articles/Tally+Table/72993/
In the comments it says:
Optimized for single character delimiter. Multi-character delimiters should be resolved externally from this
function.
However, I know I ran across a similarly optimized, inline function that supports multi-character delimiters. I just can't find it!
Can someone please reply with the link?
Gerald Britton, Pluralsight courses

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

g.britton SSChampion Points: 13863 More actions · Answer 1

Found it here:

http://www.sqlservercentral.com/articles/String+Manipulation/94365/

Silly me, I was searching for "multi-character delimiters" Had I searched for "patterns" I would have hit it right away!

Gerald Britton, Pluralsight courses

Luis Cazares SSC Guru Points: 183706 More actions · Answer 2

Note that those functions aren't intended for multi-character delimiters.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

g.britton SSChampion Points: 13863 More actions · Answer 3

Luis Cazares (11/5/2014)
Note that those functions aren't intended for multi-character delimiters.

Well....a multi-character delimiter is just a simple pattern without wildcards. So even though that's not the design goal, it is usable for that goal I believe.

Gerald Britton, Pluralsight courses

Alan Burstein SSC Guru Points: 61152 More actions · Answer 4

g.britton (11/5/2014)
I've been using Jeff Moden's excellent Tally OH splitter for a while
http://www.sqlservercentral.com/articles/Tally+Table/72993/
In the comments it says:
Optimized for single character delimiter. Multi-character delimiters should be resolved externally from this
function.
However, I know I ran across a similarly optimized, inline function that supports multi-character delimiters. I just can't find it!
Can someone please reply with the link?

I don't have a link but perhaps this will help... I re-factored delimitedSplit8k some time back for this purpose. I call it LiteralSplit8K. Please be warned - this is a hacked version of Jeff Moden's splitter and has not been tested as thoroughly by anyone on this forum.

ALTER FUNCTION [dbo].[LiteralSplit8K](@pString VARCHAR(8000), @pDelimiter VARCHAR(50))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT 1 UNION ALL

SELECT t.N+(len(@pDelimiter)) FROM cteTally t WHERE SUBSTRING(@pString,t.N,(len(@pDelimiter))) = @pDelimiter

),

cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

SELECT s.N1,

ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

--ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)

FROM cteStart s

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

Item = SUBSTRING(@pString, l.N1, l.L1)

FROM cteLen l;

/*

Example:

SELECT *

FROM dbo.LiteralSplit8K('xxx<br/>rrr<br/>ttt','<br/>')

*/

GO

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61152 More actions · Answer 5

emph. mine

g.britton (11/5/2014)
Luis Cazares (11/5/2014)
Note that those functions aren't intended for multi-character delimiters.
Well....a multi-character delimiter is just a simple pattern without wildcards. So even though that's not the design goal, it is usable for that goal I believe.

That is true. I think Luis' point was that you might not get the performance that you want. If you are not dealing with a pattern you could do something like this:

SELECT item

FROM sqltoolbox.dbo.DelimitedSplit8K

(

replace('blah1 \r\ blah2 \r\ blah3','\r\',char(2)),

char(2)

);

... which will get you the same results as this but with better performance:

SELECT item

FROM sqltoolbox.dbo.PatternSplitCM('blah1 \r\ blah2 \r\ blah3','%[\r\]%')

WHERE [Matched] = 0

Both DelimitedSplit8K and PatternSplitCM are the best at what they were intended for but they were not intended for the same thing.

Edit: minor typo in my PatternSplitCM code.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

g.britton SSChampion Points: 13863 More actions · Answer 6

Alan.B (11/5/2014)
emph. mine
g.britton (11/5/2014)
Luis Cazares (11/5/2014)
Note that those functions aren't intended for multi-character delimiters.
Well....a multi-character delimiter is just a simple pattern without wildcards. So even though that's not the design goal, it is usable for that goal I believe.
That is true. I think Luis' point was that you might not get the performance that you want. If you are not dealing with a pattern you could do something like this:

Actually I don't think that was Luis' point at all. He merely pointed out that the pattern splitter expects a pattern in the form of a PATINDEX or LIKE pattern. He said nothing about performance.

In my case, I want to split a string based on a series of recurring characters, e.g.

blah1[series of recurring characters]blah2[series of recurring characters]blah3[series of recurring characters]

I proposed leveraging the pattern splitter function to do that, specifying '%[series of recurring characters]%' as the pattern

Gerald Britton, Pluralsight courses

Luis Cazares SSC Guru Points: 183706 More actions · Answer 7

You're right, it wasn't about performance. It's because the splitter works on a single char basis. Modifying Alan's example could give us undesired results.

SELECT *

FROM dbo.PatternSplitCM('blrah1 \r\ bl\ah2 \r\ blah3','%[\r\]%')

WHERE [Matched] = 0

Results:

ItemNumber Item Matched

------------ ----------------- -----------

1 bl 0

3 ah1 0

5 bl 0

7 ah2 0

9 blah3 0

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

g.britton SSChampion Points: 13863 More actions · Answer 8

yes, of course, but with those caveats it should still be useful

Edit: I'm wrong. It can't be used in the way I imagined.

e.g. try to find hrefs in some html:

SELECT *

FROM dbo.PatternSplitCM('blrah1 <a href="foo"> bl\ah2</a> <a href="bar"> blah3 </a>','%<a%>%')

WHERE [Matched] = 0

returns

ItemNumberItemMatched

1 blrah1 <a href="foo"> bl\ah2</a> <a href="bar"> blah3 </a>0

So I'm going back to Alan's modified splitter.

Gerald Britton, Pluralsight courses