String splitter function for mutli-char delimeters

  • I've been using Jeff Moden's excellent Tally OH splitter for a while

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    In the comments it says:

    Optimized for single character delimiter. Multi-character delimiters should be resolved externally from this

    function.

    However, I know I ran across a similarly optimized, inline function that supports multi-character delimiters. I just can't find it!

    Can someone please reply with the link?

    Gerald Britton, Pluralsight courses

  • Found it here:

    http://www.sqlservercentral.com/articles/String+Manipulation/94365/

    Silly me, I was searching for "multi-character delimiters" Had I searched for "patterns" I would have hit it right away!

    Gerald Britton, Pluralsight courses

  • Note that those functions aren't intended for multi-character delimiters.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (11/5/2014)


    Note that those functions aren't intended for multi-character delimiters.

    Well....a multi-character delimiter is just a simple pattern without wildcards. So even though that's not the design goal, it is usable for that goal I believe.

    Gerald Britton, Pluralsight courses

  • g.britton (11/5/2014)


    I've been using Jeff Moden's excellent Tally OH splitter for a while

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    In the comments it says:

    Optimized for single character delimiter. Multi-character delimiters should be resolved externally from this

    function.

    However, I know I ran across a similarly optimized, inline function that supports multi-character delimiters. I just can't find it!

    Can someone please reply with the link?

    I don't have a link but perhaps this will help... I re-factored delimitedSplit8k some time back for this purpose. I call it LiteralSplit8K. Please be warned - this is a hacked version of Jeff Moden's splitter and has not been tested as thoroughly by anyone on this forum.

    ALTER FUNCTION [dbo].[LiteralSplit8K](@pString VARCHAR(8000), @pDelimiter VARCHAR(50))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1 UNION ALL

    SELECT t.N+(len(@pDelimiter)) FROM cteTally t WHERE SUBSTRING(@pString,t.N,(len(@pDelimiter))) = @pDelimiter

    ),

    cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

    --ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l;

    /*

    Example:

    SELECT *

    FROM dbo.LiteralSplit8K('xxx<br/>rrr<br/>ttt','<br/>')

    */

    GO

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • emph. mine

    g.britton (11/5/2014)


    Luis Cazares (11/5/2014)


    Note that those functions aren't intended for multi-character delimiters.

    Well....a multi-character delimiter is just a simple pattern without wildcards. So even though that's not the design goal, it is usable for that goal I believe.

    That is true. I think Luis' point was that you might not get the performance that you want. If you are not dealing with a pattern you could do something like this:

    SELECT item

    FROM sqltoolbox.dbo.DelimitedSplit8K

    (

    replace('blah1 \r\ blah2 \r\ blah3','\r\',char(2)),

    char(2)

    );

    ... which will get you the same results as this but with better performance:

    SELECT item

    FROM sqltoolbox.dbo.PatternSplitCM('blah1 \r\ blah2 \r\ blah3','%[\r\]%')

    WHERE [Matched] = 0

    Both DelimitedSplit8K and PatternSplitCM are the best at what they were intended for but they were not intended for the same thing.

    Edit: minor typo in my PatternSplitCM code.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (11/5/2014)


    emph. mine

    g.britton (11/5/2014)


    Luis Cazares (11/5/2014)


    Note that those functions aren't intended for multi-character delimiters.

    Well....a multi-character delimiter is just a simple pattern without wildcards. So even though that's not the design goal, it is usable for that goal I believe.

    That is true. I think Luis' point was that you might not get the performance that you want. If you are not dealing with a pattern you could do something like this:

    Actually I don't think that was Luis' point at all. He merely pointed out that the pattern splitter expects a pattern in the form of a PATINDEX or LIKE pattern. He said nothing about performance.

    In my case, I want to split a string based on a series of recurring characters, e.g.

    blah1[series of recurring characters]blah2[series of recurring characters]blah3[series of recurring characters]

    I proposed leveraging the pattern splitter function to do that, specifying '%[series of recurring characters]%' as the pattern

    Gerald Britton, Pluralsight courses

  • You're right, it wasn't about performance. It's because the splitter works on a single char basis. Modifying Alan's example could give us undesired results.

    SELECT *

    FROM dbo.PatternSplitCM('blrah1 \r\ bl\ah2 \r\ blah3','%[\r\]%')

    WHERE [Matched] = 0

    Results:

    ItemNumber Item Matched

    ------------ ----------------- -----------

    1 bl 0

    3 ah1 0

    5 bl 0

    7 ah2 0

    9 blah3 0

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • yes, of course, but with those caveats it should still be useful

    Edit: I'm wrong. It can't be used in the way I imagined.

    e.g. try to find hrefs in some html:

    SELECT *

    FROM dbo.PatternSplitCM('blrah1 <a href="foo"> bl\ah2</a> <a href="bar"> blah3 </a>','%<a%>%')

    WHERE [Matched] = 0

    returns

    ItemNumberItemMatched

    1 blrah1 <a href="foo"> bl\ah2</a> <a href="bar"> blah3 </a>0

    So I'm going back to Alan's modified splitter.

    Gerald Britton, Pluralsight courses

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply