Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Tested function with 2 delimiters, works great, although it takes some time to run (20 minutes to run on a table with 887,994 rows whose splitting resulted in 13,397,215 rows).

  • ahpitre (2/13/2013)


    How do you use this function? Also, how can I pass an additional parameter, so it's always inserted into the new table? I have a column named Part. I want the table with the split to include Part (which is repeated for all substrings that are splitted from main string). My final output should be something like this :

    Input_table

    Part Specs

    ---------------------------------

    123 D-dfldkk; P-4987843; D48974587

    456 A-dfldkk; Z-4987843

    Output_table (created by Split function)

    Part Specs

    ---------------------------------

    123 D-dfldkk

    123 P-4987843

    123 D48974587

    456 A-dfldkk

    456 Z-4987843

    By using cross apply. I will show you the code but please PLEASE PLEASE do NOT just blindly use this code. You need to understand what it is doing so you are able to support it. It is your phone that will be ringing at 3am not mine. 😛

    if object_id('tempdb..#Input') is not null

    drop table #Input

    create table #Input

    (

    Part int,

    Specs varchar(50)

    )

    insert #Input

    select 123, 'D-dfldkk; P-4987843; D48974587' union all

    select 456, 'A-dfldkk; Z-4987843'

    --The above is your table

    --To avoid the performance issues with using replace while calling the DelimitedSplit8K function

    --we can do this with a cte

    ;with cte as

    (

    select Part, Replace(Specs, '; ', ';') as Specs

    from #Input

    )

    --Now we just need to retrieve the data

    select cte.Part, s.Item as Specs

    from cte

    cross apply dbo.DelimitedSplit8k(cte.Specs, ';') s

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • THis function works great, but, has the following limitations or things that can be improved :

    1) If the string to be splitted has the delimiter character at the end, then, it returns an additional item/ItemNumber row where item will be empty. For example :

    Searching for ; on a string that contains 'ddkfjdkdjkdfkdjl;1111;' will result in :

    ItemNumber Item

    ------------------------------------------

    1 ddkfjdkdjkdfkdjl

    2 1111

    3

    So, function needs to be fixed to not return the last item if it will result in an empty item (NULL or '').

    2) I managed to put all my delimiters on a table, then, CROSS Join my main table to the Delimiters table. The result allows you to dynamically insert the character being searched for, and, searching for multiple characters without having to hard code them on the SQL command. An example of my 1st step prior to using the delimiter8k function :

    SELECT * FROM dbo.tblDelimiters CROSS JOIN

    dbo.qryDoc_Ref_Specs_to_Doc_Controlling_Parts

    WHERE (dbo.qryDoc_Ref_Specs_to_Doc_Controlling_Parts.[Reference Specs] LIKE N'%' + dbo.tblDelimiters.delimiter)

    I then use the above results as my data source for the CROS APPLY query with the delimited8K function. My query then looks like this :

    SELECT DISTINCT *

    FROM [ESM].[dbo].[qryDoc_Ref_Specs_to_Doc_Controlling_Parts_delimited_Ref_Specs] CROSS APPLY dbo.DelimitedSplit8K([Reference Specs],

    [Reference Specs delimiter])

  • ahpitre (2/18/2013)


    THis function works great, but, has the following limitations or things that can be improved :

    1) If the string to be splitted has the delimiter character at the end, then, it returns an additional item/ItemNumber row where item will be empty. For example :

    Searching for ; on a string that contains 'ddkfjdkdjkdfkdjl;1111;' will result in :

    ItemNumber Item

    ------------------------------------------

    1 ddkfjdkdjkdfkdjl

    2 1111

    3

    So, function needs to be fixed to not return the last item if it will result in an empty item (NULL or '').

    2) I managed to put all my delimiters on a table, then, CROSS Join my main table to the Delimiters table. The result allows you to dynamically insert the character being searched for, and, searching for multiple characters without having to hard code them on the SQL command. An example of my 1st step prior to using the delimiter8k function :

    SELECT * FROM dbo.tblDelimiters CROSS JOIN

    dbo.qryDoc_Ref_Specs_to_Doc_Controlling_Parts

    WHERE (dbo.qryDoc_Ref_Specs_to_Doc_Controlling_Parts.[Reference Specs] LIKE N'%' + dbo.tblDelimiters.delimiter)

    I then use the above results as my data source for the CROS APPLY query with the delimited8K function. My query then looks like this :

    SELECT DISTINCT *

    FROM [ESM].[dbo].[qryDoc_Ref_Specs_to_Doc_Controlling_Parts_delimited_Ref_Specs] CROSS APPLY dbo.DelimitedSplit8K([Reference Specs],

    [Reference Specs delimiter])

    Delimiter at the end means the same thing as a delimiter at the beginning. There's a missing element. Don't fix it in the splitter. Fix it in the code that uses the delimiter output.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    I just sent you an email with some interesting performance test results if you get a chance to have a look.

    😎

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (2/21/2013)


    Hi Jeff,

    I just sent you an email with some interesting performance test results if you get a chance to have a look.

    😎

    Hey, NOT COOL!! We do PUBLIC performance testing around these parts! You test it, you post it! :hehe:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (2/22/2013)


    Hey, NOT COOL!! We do PUBLIC performance testing around these parts! You test it, you post it! :hehe:

    Right, if I get flamed for this, you have to save me !

    I have crudely highlighted the two winners...the CLR split clearly still way ahead, but the new T-SQL version not too shabby

    Attached testing results data spreadsheet for those that trust me enough to open it (it is safe !)

    And here is the code (with notes removed for brevity) and the changes underlined.

    CREATE FUNCTION [dbo].[DelimitedSplit8KB]

    --===== Define I/O parameters

    (@pString VARCHAR(8000) , @pDelimiter CHAR(1))

    --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1 UNION ALL

    SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    ),

    cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter COLLATE Latin1_General_BIN,@pString COLLATE Latin1_General_BIN,s.N1) ,0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    Sorry if I have made a mistake, but I think this is valid.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (2/22/2013)


    And here is the code (with notes removed for brevity) and the changes underlined.

    Sorry if I have made a mistake, but I think this is valid.

    Looks good to me - using a binary collation probably speeds up the comparison.

    Something completely different though: the three comments

    --10E+1 or 10 rows

    --10E+2 or 100 rows

    --10E+4 or 10,000 rows max

    are all wrong: "10E" should be "1E" in each case.

    Tom

  • L' Eomot Inversé (2/22/2013)


    mister.magoo (2/22/2013)


    And here is the code (with notes removed for brevity) and the changes underlined.

    Sorry if I have made a mistake, but I think this is valid.

    Looks good to me - using a binary collation probably speeds up the comparison....

    Confirmed. A process I'm currently working on which makes extensive use of the function, runs in less than half the time using MM's modification.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (2/22/2013)


    L' Eomot Inversé (2/22/2013)


    mister.magoo (2/22/2013)


    And here is the code (with notes removed for brevity) and the changes underlined.

    Sorry if I have made a mistake, but I think this is valid.

    Looks good to me - using a binary collation probably speeds up the comparison....

    Confirmed. A process I'm currently working on which makes extensive use of the function, runs in less than half the time using MM's modification.

    Nice to hear confirmation from a real example.

    What bothered me was that feeling that "someone should have thought of this ages ago, shouldn't they"?

    Always makes me nervous when that is the case...

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (2/22/2013)


    ChrisM@Work (2/22/2013)


    L' Eomot Inversé (2/22/2013)


    mister.magoo (2/22/2013)


    And here is the code (with notes removed for brevity) and the changes underlined.

    Sorry if I have made a mistake, but I think this is valid.

    Looks good to me - using a binary collation probably speeds up the comparison....

    Confirmed. A process I'm currently working on which makes extensive use of the function, runs in less than half the time using MM's modification.

    Nice to hear confirmation from a real example.

    What bothered me was that feeling that "someone should have thought of this ages ago, shouldn't they"?

    Always makes me nervous when that is the case...

    Are you saying all you did was add that COLLATE in 2 places???

    Magoo, you've done it again!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (2/22/2013)


    mister.magoo (2/22/2013)


    ChrisM@Work (2/22/2013)


    L' Eomot Inversé (2/22/2013)


    mister.magoo (2/22/2013)


    And here is the code (with notes removed for brevity) and the changes underlined.

    Sorry if I have made a mistake, but I think this is valid.

    Looks good to me - using a binary collation probably speeds up the comparison....

    Confirmed. A process I'm currently working on which makes extensive use of the function, runs in less than half the time using MM's modification.

    Nice to hear confirmation from a real example.

    What bothered me was that feeling that "someone should have thought of this ages ago, shouldn't they"?

    Always makes me nervous when that is the case...

    Are you saying all you did was add that COLLATE in 2 places???

    Magoo, you've done it again!

    Well, to be fair, I did "add that COLLATE" clause in four times, so it's not like I was slacking off or nothing...:-)

    And, I did test it as well and put the results in Excel, so all in all I need a lie down :doze:

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (2/22/2013)


    What bothered me was that feeling that "someone should have thought of this ages ago, shouldn't they"?

    Always makes me nervous when that is the case...

    That feeling was bothering me too - I can't understand why I didn't spot the opportunity the first time I looked at the code, let alone why Jeff or Paul or another of the top experts who contribute so much to SQLServerCentral didn't do it long ago.

    Looks as if you're the only one who's awake around here, Mr M. 😎

    Tom

  • dwain.c (2/22/2013)


    mister.magoo (2/22/2013)


    ChrisM@Work (2/22/2013)


    L' Eomot Inversé (2/22/2013)


    mister.magoo (2/22/2013)


    And here is the code (with notes removed for brevity) and the changes underlined.

    Sorry if I have made a mistake, but I think this is valid.

    Looks good to me - using a binary collation probably speeds up the comparison....

    Confirmed. A process I'm currently working on which makes extensive use of the function, runs in less than half the time using MM's modification.

    Nice to hear confirmation from a real example.

    What bothered me was that feeling that "someone should have thought of this ages ago, shouldn't they"?

    Always makes me nervous when that is the case...

    Are you saying all you did was add that COLLATE in 2 places???

    Magoo, you've done it again!

    I did use binary COLLATion only in one place and used INT datatype for the Phyical tally table quite a bit of time ago, which showed twice as much of improvement ;-). I did share it here but somehow it got lost in some other debates 🙂 Nice to see it is getting more attention now 😎

  • mister.magoo (2/22/2013)

    WHERE SUBSTRING(@pString,t.N,1) COLLATE Latin1_General_BIN = @pDelimiter COLLATE Latin1_General_BIN

    COLLATE only needs to be on one side of the expression. Either side will do. You don't have to put it on both sides.

  • Viewing 15 posts - 511 through 525 (of 990 total)

    You must be logged in to reply to this topic. Login to reply