Split input string into multicolumn - multirows

  • mister.magoo (12/20/2012)


    Jeff Moden (12/19/2012)


    I'm getting some conflicting results on the performance testing. A 10 year old single cpu desktop computer and a laptop probably aren't the best places for me to test on. I'll try to reserve a little time to test these on the server at work today (it's just after midnight here).

    Having read recently about the problems with using set statistics time on, I used a server side trace to measure when I was testing.

    I also saw some strange results, mostly from Lynn's solution which, for some reason would sometimes have terrible elapsed time, but I took that to be environmental as it was seemingly random.

    The other thing on this testing is that the data is vertically homogenous. All of the delimiters are in precisely the same spot on every row and we've seen where that does give certain splitting methods a seeming advantage.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • [EDIT FOR CLARIFICATION: When using the DelimitedSplit8K function it makes no sense to use it on strings longer than 8K and it will not perform as well as other methods if the datatype is VARCHAR(MAX). In the code posted below, make sure to change the datatypes to VARCHAR(8000) when the DelimitedSplit8K function is involved. Thanks to Jeff for pointing this out.]

    OK...I came up with another variation of the XML splitter function. I also wanted to compare the various methods based on the size of the sample string to parse AND to compare methods when the size of the values in the value pairs was not constant. I wrote a function that created value pairs first as just numerics 0-9 and then one with alphanumerics. Both of these generated rows in which the value pairs were random and different for each row. I've posted the various functions and scripts below along with the time comparisons.

    First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.

    IF OBJECT_ID('tempdb..#Strings') IS NOT NULL

    DROP TABLE #Strings

    CREATE TABLE #Strings (ID INT IDENTITY, MyString VARCHAR(MAX))

    ;WITH Tally (n) AS (

    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b)

    INSERT INTO #Strings

    SELECT CAST((

    SELECT CAST(n AS VARCHAR(5)) + ';' + CAST(40000+n AS VARCHAR(6)) + '|'

    FROM Tally

    FOR XML PATH('')) AS VARCHAR(MAX))

    FROM Tally

    The first row looked like this (and all the rows were the same):

    MyString

    1;40001|2;40002|3;40003| [...] |997;40997|998;40998|999;40999|1000;41000|

    Performance:

    (100 row(s) affected) ******************************************************

    DelimitedSplit8K (Lynn Pettis)

    SQL Server Execution Times:

    CPU time = 391 ms, elapsed time = 1129 ms.

    PatternSplitCM (Dwain.C)

    SQL Server Execution Times:

    CPU time = 750 ms, elapsed time = 22446 ms.

    XML (Mister Magoo)

    SQL Server Execution Times:

    CPU time = 421 ms, elapsed time = 437 ms.

    DelimitedSplit_XML2 (Steven Willis) (Code for the function is below)

    SQL Server Execution Times:

    CPU time = 32 ms, elapsed time = 23 ms.

    (1000 row(s) affected) ******************************************************

    DelimitedSplit8K (Lynn Pettis)

    SQL Server Execution Times:

    (I gave up after 7 minutes)

    PatternSplitCM (Dwain.C)

    SQL Server Execution Times:

    CPU time = 40157 ms, elapsed time = 334170 ms.

    XML (Mister Magoo)

    SQL Server Execution Times:

    CPU time = 39891 ms, elapsed time = 40201 ms.

    DelimitedSplit_XML2 (Steven Willis)

    SQL Server Execution Times:

    CPU time = 1453 ms, elapsed time = 1573 ms.

    10000 row(s) affected) ******************************************************

    [Other methods all took more than 7 mins which was my "give up" threshold]

    DelimitedSplit_XML2 (Steven Willis)

    SQL Server Execution Times:

    CPU time = 152500 ms, elapsed time = 229167 ms.

    The second series of tests used a function to generate random data of different lengths for each item of each value pair and for each row. I generated test data with 100 rows and 1,000 rows. In this test, the results were quite different!

    IF OBJECT_ID('tempdb..#Strings') IS NOT NULL

    DROP TABLE #Strings

    CREATE TABLE #Strings (ID INT IDENTITY, MyString VARCHAR(MAX))

    DECLARE @X INT

    SET @X = 1

    WHILE @X <= 100

    BEGIN

    INSERT INTO #Strings

    SELECT CAST((

    SELECT TOP(100)

    dbo.svfGenerateUniqueCode

    (0,ISNULL(NULLIF(FLOOR(CAST(RAND(t1.N*10000000) AS DECIMAL(18,4))*10),0),10),'')

    + ';' +

    dbo.svfGenerateUniqueCode

    (0,ISNULL(NULLIF(FLOOR(CAST(RAND((t1.N+100)*10000000) AS DECIMAL(18,4))*10),0),10),'')

    + '|'

    FROM

    dbo.Tally AS t1

    FOR XML PATH('')) AS VARCHAR(MAX))

    SET @X = @X+1

    END

    The test rows looked like this:

    MyString (each row a different set of random numbers of variable length which will be different for every run)

    ID MyString

    1 5061;1552|4237254;407035|27;5|042;617152|6253541;336|3;6|234;646436|1004512557;704|...

    2 4535;501356|5561164;47|100;2|0764;6|4041310;051|45;4|47;1621|3;605|2;2477|...

    3 270301173;14|60;20751|44273;327607|2444234;663274273|2;404426|4;165211|...

    Performance:

    (100 row(s) affected) ******************************************************

    DelimitedSplit8K (Lynn Pettis)

    SQL Server Execution Times:

    CPU time = 593 ms, elapsed time = 617 ms.

    PatternSplitCM (Dwain.C)

    SQL Server Execution Times:

    CPU time = 594 ms, elapsed time = 5392 ms.

    XML (Mister Magoo)

    SQL Server Execution Times:

    CPU time = 421 ms, elapsed time = 412 ms.

    DelimitedSplit_XML2 (Steven Willis)

    SQL Server Execution Times:

    CPU time = 625 ms, elapsed time = 640 ms.

    (1000 row(s) affected) ******************************************************

    DelimitedSplit8K (Lynn Pettis)

    SQL Server Execution Times:

    CPU time = 4922 ms, elapsed time = 14138 ms.

    PatternSplitCM (Dwain.C)

    SQL Server Execution Times:

    CPU time = 8156 ms, elapsed time = 172964 ms.

    XML (Mister Magoo)

    SQL Server Execution Times:

    CPU time = 4312 ms, elapsed time = 4481 ms.

    DelimitedSplit_XML2 (Steven Willis)

    SQL Server Execution Times:

    CPU time = 6281 ms, elapsed time = 6835 ms.

    For random alphanumerics:

    MyString (each row a different set of random alphanumerics of variable length which will be different for every run)

    ID MyString

    1 Uu6xS;e7Gd|cQm;J|DQ;BgU|F4;f|UqEqZx;fWK|E0k1WRan;h|Pz;1fd|...

    2 fEsK;mUN20WCD|ehSK;SFNB5|YN5mgLY;QjAVec6|5;fq|R3z;4uUS|...

    3 5hnr;6P|Yvqb6Pn;P3r|vW4q;h6G|uuaHXx7Fwq;5|QS;gs7|xgn7K;4cCebC|...

    Performance:

    (100 row(s) affected) ******************************************************

    DelimitedSplit8K (Lynn Pettis)

    SQL Server Execution Times:

    CPU time = 640 ms, elapsed time = 371 ms.

    PatternSplitCM (Dwain.C)

    SQL Server Execution Times:

    CPU time = 328 ms, elapsed time = 1523 ms.

    XML (Mister Magoo)

    SQL Server Execution Times:

    CPU time = 454 ms, elapsed time = 466 ms.

    DelimitedSplit_XML2 (Steven Willis)

    SQL Server Execution Times:

    CPU time = 578 ms, elapsed time = 585 ms.

    (1000 row(s) affected) ******************************************************

    DelimitedSplit8K (Lynn Pettis)

    SQL Server Execution Times:

    CPU time = 5125 ms, elapsed time = 14625 ms.

    PatternSplitCM (Dwain.C)

    SQL Server Execution Times:

    CPU time = 4281 ms, elapsed time = 58210 ms.

    XML (Mister Magoo)

    SQL Server Execution Times:

    CPU time = 4313 ms, elapsed time = 4305 ms.

    DelimitedSplit_XML2 (Steven Willis)

    SQL Server Execution Times:

    CPU time = 5937 ms, elapsed time = 5955 ms.

    The test code used to run each test:

    DECLARE @Holder1 VARCHAR(MAX), @Holder2 VARCHAR(MAX)

    PRINT 'DelimitedSplit8K (Lynn Pettis)'

    SET STATISTICS TIME ON

    SELECT

    @Holder1 = MAX(CASE ds2.ItemNumber

    WHEN 1 THEN ds2.Item

    ELSE ''

    END)

    , --as field1,

    @Holder2 = MAX(CASE ds2.ItemNumber

    WHEN 2 THEN ds2.Item

    ELSE ''

    END) --as field2

    FROM

    #Strings

    CROSS APPLY dbo.DelimitedSplit8K(MyString,'|') ds1

    CROSS APPLY dbo.DelimitedSplit8K(ds1.Item,';') ds2

    WHERE

    ds1.Item <> ''

    GROUP BY

    ID

    ,ds1.ItemNumber ;

    SET STATISTICS TIME OFF

    PRINT 'PatternSplitCM (Dwain.C)'

    SET STATISTICS TIME ON

    SELECT

    @Holder1 = MAX(CASE ItemNumber % 4

    WHEN 1 THEN Item

    END)

    ,@Holder2 = MAX(CASE ItemNumber % 4

    WHEN 3 THEN Item

    END)

    FROM

    #Strings

    CROSS APPLY PatternSplitCM(MyString,'[0-9]')

    WHERE

    [Matched] = 1

    GROUP BY

    ID

    ,ItemNumber / 4

    --OPTION (MAXDOP 1)

    SET STATISTICS TIME OFF

    PRINT 'XML (Mister Magoo)'

    GO

    SET STATISTICS TIME ON

    GO

    DECLARE @Holder1 VARCHAR(MAX), @Holder2 VARCHAR(MAX)

    SELECT

    @Holder1 = nd.value('./@a','varchar(200)')

    ,@Holder2 = nd.value('./@b','varchar(200)')

    FROM

    (

    SELECT

    CAST('<c a="' + REPLACE(REPLACE(MyString,'|','"/><c a="'),';','" b="') + '"/>' AS XML).query('.')

    FROM

    #Strings

    FOR

    XML PATH('')

    ,TYPE

    ) AS src (nds)

    CROSS APPLY nds.nodes('c') AS x (nd)

    WHERE

    nd.exist('./@b') = 1

    GO

    SET STATISTICS TIME OFF

    GO

    PRINT 'DelimitedSplit_XML2 (Steven Willis)'

    GO

    SET STATISTICS TIME ON

    GO

    DECLARE

    @Holder3 VARCHAR(MAX)

    ,@Holder4 VARCHAR(MAX)

    SELECT

    @Holder3 = Col1

    ,@Holder4 = Col2

    FROM

    (

    SELECT

    ID

    ,dsk1.Col1

    ,dsk1.Col2

    FROM

    #Strings a

    CROSS APPLY dbo.tvfDelimitedSplitXML2(MyString,'|',';') AS dsk1

    ) r1

    GO

    SET STATISTICS TIME OFF

    GO

    ANOTHER splitter function: 😛

    /* This function will also split 1-dimensional arrays if Delimiter2 is left blank */

    CREATE FUNCTION [dbo].[tvfDelimitedSplitXML2]

    (

    @InputString VARCHAR(MAX)

    ,@Delimiter1 CHAR(1)

    ,@Delimiter2 CHAR(1)

    )

    RETURNS @Elements TABLE

    (

    ItemNumber INT IDENTITY(1,1) NOT NULL

    ,Col1 VARCHAR(MAX) NULL

    ,Col2 VARCHAR(MAX) NULL

    ,PRIMARY KEY (ItemNumber)

    )

    WITH SCHEMABINDING

    AS

    BEGIN

    DECLARE

    @X VARCHAR(MAX)

    ,@XML XML

    ,@NumElements INT

    ,@Counter INT

    IF RIGHT(@InputString,1) = @Delimiter1

    SET @InputString = LEFT(@InputString,LEN(@InputString)-1)

    IF LEFT(@InputString,1) = @Delimiter1

    SET @InputString = RIGHT(@InputString,LEN(@InputString)-1)

    IF NULLIF(@Delimiter2,'') IS NULL

    BEGIN

    -- Parse the 1-dimensional delimited string array

    SET @X = '<root><s>' + REPLACE(@InputString,@Delimiter1,'</s><s>')+'</s></root>'

    -- Convert the string into XML

    SET @XML = CONVERT(XML,@X)

    -- Select the rows from the array

    INSERT INTO @Elements (Col1)

    SELECT

    T.c.value('.','VARCHAR(MAX)') AS [Value] FROM @XML.nodes('/root/s') T (c)

    END

    ELSE

    BEGIN

    -- Parse the 2-dimensional delimited string array

    SET @X = '<s1>' + REPLACE(@InputString,@Delimiter1,'</s2><s1>')+'</s2>'

    SET @X = REPLACE(@X,@Delimiter2,'</s1><s2>')

    SET @X = '<root>'+@X+'</root>'

    -- Convert the string into XML

    SET @XML = CONVERT(XML,@X)

    -- Insert for each pair in the array

    INSERT INTO @Elements

    SELECT

    Col1

    ,Col2

    FROM

    (SELECT

    ROW_NUMBER() OVER(ORDER BY Col1) AS RowID1

    ,Col1

    FROM

    (SELECT T.c.value('.','VARCHAR(MAX)') AS [Col1] FROM @XML.nodes('/root/s1') T (c)) AS Col1

    ) AS Col1

    INNER JOIN

    (SELECT

    ROW_NUMBER() OVER(ORDER BY Col2) AS RowID2

    ,Col2

    FROM

    (SELECT T.c.value('.','VARCHAR(MAX)') AS [Col2] FROM @XML.nodes('/root/s2') T (c)) AS Col2

    ) AS Col2

    ON Col1.RowID1 = Col2.RowID2

    END

    RETURN

    /*

    Usage:

    SELECT * FROM dbo.tvfDelimitedSplitXML2(

    '30;38469|31;38470|32;38471|33;38472|34;38473|35;38474|36;38475|37;38476|38;38477|'

    ,'|'

    ,';'

    )

    WHERE ItemNumber > 0

    SELECT * FROM dbo.tvfDelimitedSplitXML2(

    'Uu6xS;e7Gd|cQm;J|DQ;BgU|F4;f|UqEqZx;fWK|E0k1WRan;h|Pz;1fd|wjZchH;meYAb|w1U7uhj;rusCg'

    ,'|'

    ,';'

    )

    WHERE ItemNumber > 0

    SELECT * FROM dbo.tvfDelimitedSplitXML2(

    'Element01,Element02,Element03,Element04,Element05'

    ,',','')

    WHERE ItemNumber > 0

    */

    END

    Other code for reference:

    --Used to pass 'NEWID()' into the function below

    CREATE VIEW [dbo].[vwRandomGUID]

    AS

    SELECT

    NEWID() AS RandomGUID

    GO

    --Generates random values

    CREATE FUNCTION [dbo].[svfGenerateUniqueCode]

    (

    @CodeMinLength INT

    ,@CodeMaxLength INT

    ,@SpecialChar VARCHAR(50)

    )

    RETURNS VARCHAR(100)

    AS

    BEGIN

    DECLARE @Code VARCHAR(100)

    DECLARE @CodeData TABLE

    (

    CodeChar VARCHAR(1)

    )

    DECLARE @Num TABLE

    (

    Digit INT NOT NULL

    PRIMARY KEY CLUSTERED

    )

    IF @CodeMaxLength <= @CodeMinLength

    SET @CodeMaxLength = @CodeMinLength + 1

    INSERT INTO @Num

    (

    Digit

    )

    SELECT

    Digit = 0

    UNION ALL

    SELECT

    Digit = 1

    UNION ALL

    SELECT

    Digit = 2

    UNION ALL

    SELECT

    Digit = 3

    UNION ALL

    SELECT

    Digit = 4

    UNION ALL

    SELECT

    Digit = 5

    UNION ALL

    SELECT

    Digit = 6

    UNION ALL

    SELECT

    Digit = 7

    UNION ALL

    SELECT

    Digit = 8

    UNION ALL

    SELECT

    Digit = 9

    ORDER BY

    1

    INSERT INTO @CodeData

    (

    CodeChar

    )

    SELECT

    CodeChar = SUBSTRING(b.Characters,a.RAND_INT % b.MOD,1)

    FROM

    (

    SELECT

    aa.Number

    ,RAND_INT = ABS(CONVERT(INT,CONVERT(VARBINARY(100),(SELECT RandomGUID FROM dbo.vwRandomGUID))))

    FROM

    (

    SELECT Number = a.Digit + (b.Digit * 10) FROM @Num a CROSS JOIN @Num b

    ) aa

    ) a

    CROSS JOIN

    (

    SELECT

    MOD = LEN(bb.Characters) - 1

    ,bb.Characters

    FROM

    (

    SELECT

    Characters =

    'ABCDEFGHJKLMNPQURSUVWXYZ'

    + 'abcdefghjkmnpqursuvwxyz'

    + '0123456789'

    + @SpecialChar

    ) bb

    ) b

    ORDER BY

    (SELECT RandomGUID FROM dbo.vwRandomGUID)

    SELECT

    @Code = ''

    SELECT

    @Code = @Code + CodeChar

    FROM

    @CodeData

    SELECT

    @Code =

    -- Random length from MIN to MAX Characters

    SUBSTRING(@Code,1,@CodeMinLength + (ABS(CONVERT(INT,CONVERT(VARBINARY(100),(SELECT RandomGUID FROM dbo.vwRandomGUID))))) % (@CodeMaxLength - @CodeMinLength + 1))

    SET @Code = NULLIF(LTRIM(RTRIM(@Code)),'')

    RETURN @Code

    END

    GO

  • Steven Willis (12/21/2012)


    First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.

    There is a major fault with this testing. It's a well advertised fact that the DelimitedSplit8K method doesn't work well with blobs. In fact, any method that internally joins to a blob at the character level is guaranteed to run much slower than other methods such a XML. That fact is even stated in the "Tally OH!" ariticle.

    Methods like XML, however, are (with the understanding that I've not had the time to do the testing the code on this thread deservers) typically much slower than the DelimitedSplit8K method on datatypes of VARCHAR(8000) or less because of the required concatenations.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To wit, and it's been proven time and again, if you can use CLR on your server, a CLR splitter is going to be the best way to go on this. It will handle both VARCHAR and NVARCHAR without having to make any special adjustments, runs just fine against blob datatypes, and it runs a bit more than twice as fast as DelimitedSplit8K even when it's used properly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • murthyvs (12/11/2012)


    Hello all - I am having hard time to split an input string into multicolumn - multirows.

    Task - Create a stored procedure that reads an input string with pre-defined field and row terminators; splits the string into multicolumn - multirows; and inserts records into a table.

    So, a couple of questions remain to solve this for you properly...

    1. Can you use CLR on your servers or not?

    2. If not, what is the guranteed absolute maximum length of your delimited parameters?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/21/2012)


    Steven Willis (12/21/2012)


    First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.

    There is a major fault with this testing. It's a well advertised fact that the DelimitedSplit8K method doesn't work well with blobs. In fact, any method that internally joins to a blob at the character level is guaranteed to run much slower than other methods such a XML. That fact is even stated in the "Tally OH!" ariticle.

    Methods like XML, however, are (with the understanding that I've not had the time to do the testing the code on this thread deservers) typically much slower than the DelimitedSplit8K method on datatypes of VARCHAR(8000) or less because of the required concatenations.

    Just to be clear, I am not suggesting that the XML method is in any way a replacement for the DelimitedSplit8K function, which is blinding fast.

    I am suggesting that it may be useful in this particular exercise because the length of the input may exceed 8000.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (12/22/2012)


    Jeff Moden (12/21/2012)


    Steven Willis (12/21/2012)


    First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.

    There is a major fault with this testing. It's a well advertised fact that the DelimitedSplit8K method doesn't work well with blobs. In fact, any method that internally joins to a blob at the character level is guaranteed to run much slower than other methods such a XML. That fact is even stated in the "Tally OH!" ariticle.

    Methods like XML, however, are (with the understanding that I've not had the time to do the testing the code on this thread deservers) typically much slower than the DelimitedSplit8K method on datatypes of VARCHAR(8000) or less because of the required concatenations.

    Just to be clear, I am not suggesting that the XML method is in any way a replacement for the DelimitedSplit8K function, which is blinding fast.

    I am suggesting that it may be useful in this particular exercise because the length of the input may exceed 8000.

    Not to worry, ol' friend. I absolutely understood that from the begining. That's why I said "N-i-i-i-i-c-c-c-c-e-e-e!" about your code post previously. It was "Thinking out of the box" at its best.

    I just want everyone to know that the DelimitedSplit8K fuction has the term "8K" in it for a bloody good reason. It's absoluetly not meant to split blobs and, if modified to do so, is going to lose just about any race in a terrible fashion as would any code that joins to a blob at the character level.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/21/2012)


    murthyvs (12/11/2012)


    Hello all - I am having hard time to split an input string into multicolumn - multirows.

    Task - Create a stored procedure that reads an input string with pre-defined field and row terminators; splits the string into multicolumn - multirows; and inserts records into a table.

    So, a couple of questions remain to solve this for you properly...

    1. Can you use CLR on your servers or not?

    2. If not, what is the guranteed absolute maximum length of your delimited parameters?

    Wow .. lots of activity when I wasnt around :). Thanks a lot gurus! I certainly learned a whole lot in the last one hour of reading the entire thread.

    Here are my answers:

    1. Sorry. I have no clue on CLR - I am a newbie with serious appetite to learn new things.

    2. The max characters for field1 = 10; field2 = 20; possible field delimiters = ";", " tab "; row delimiters = "|", "new line".

    The total string lenght is at the max 50K although 100K is nice to have.

    Thanks to everyone.

  • Jeff Moden (12/21/2012)


    Steven Willis (12/21/2012)


    First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.

    There is a major fault with this testing. It's a well advertised fact that the DelimitedSplit8K method doesn't work well with blobs. In fact, any method that internally joins to a blob at the character level is guaranteed to run much slower than other methods such a XML. That fact is even stated in the "Tally OH!" ariticle.

    Methods like XML, however, are (with the understanding that I've not had the time to do the testing the code on this thread deservers) typically much slower than the DelimitedSplit8K method on datatypes of VARCHAR(8000) or less because of the required concatenations.

    Thanks for your input Jeff. I was aware of the VARCHAR(MAX) issue and neglected to note that I did use VARCHAR(8000) whenever possible. Only on the later test runs for the XML versions did I switch to VARCHAR(MAX) and that is the code that got posted. For the XML versions the 8000 limit was easily exceeded and I was getting XML errors because the strings were getting truncated and creating "bad" XML that the XML handler wouldn't accept.

    I tested each method both ways with the given inputs and reported the best result for each test run. I also tried testing an older version of the tally table splitter using an actual tally table but the performance was so poor compared to ANY of the other methods that I didn't bother to report the results. All of this was done on my work laptop so performance would certainly improve for all the methods on a faster server.

    For various reasons, I need to do array splits often on data I do not directly own. (I work mainly on other people's databases for which I usually had no design input.) It is particularly annoying to see a delimited list stored in a varchar column that I have to split, process, then put back together again using a coalesce function because of someone's poor design. I also need to process data that is posted from websites that have multiselect dropdowns and/or checkbox lists--both of which produce delimited arrays as output that my procedures must process and convert to proper row/column format. Most of my procedures use your DelimitedSplit8K function and that has always seemed to be the fasted solution. Whenever I have to revisit an old procedure and see an older split function I replace it with a DelimitedSplit8K version if time and budget allow.

    However, in this case we were dealing with a two-dimensional array situation. I posted a solution somewhere around page 2 or 3 above that I have used frequently. That method basically used DelimitedSplit8K to do two CROSS APPLYs to split the data. When speed tests compared that method to some of the other methods it did not perform as well. So I wanted to explore alternatives. I have used the XML splitter method on one-dimensional arrays before and based on the performance of "Mr Magoo's" method I decided to try and incorporate that into one function that could handle both one- and two-dimensional arrays. But I wanted to see how each performed and if my new function would perform any better than what I had been using.

    So I did the testing and posted my results. I realize no testing situation is perfect because each method has its pros and cons depending on the form the input data takes and the scale of the split being performed. I think my test results--as imperfect as they may be--reinforced the idea that each split method will perform differently in different situations. I was neither trying to come up with the "perfect" universal splitter nor trying to prove any of these methods were bad. I think they are all good and each needs to be tested as possible alternatives when faced with any delimited split requirement.

     

  • Understood.

    I agree that the "old" Tally Table method was terrible on performance but that was almost entirely because of the concatenation of delimiters that most folks (including myself in the early days) used. That was the whole purpose of the "Taly OH!" article... to get rid of that nasty little dependency.

    Understood and agreed on the XML truncation errors on the 8K inputs. To wit, I'd never use anything but VARCHAR(MAX) on the input of such a function even if it was supposedly guaranteed to only ever receive a thousand bytes.

    For various reasons, I need to do array splits often on data I do not directly own. (I work mainly on other people's databases for which I usually had no design input.) It is particularly annoying to see a delimited list stored in a varchar column that I have to split, process, then put back together again using a coalesce function because of someone's poor design. I also need to process data that is posted from websites that have multiselect dropdowns and/or checkbox lists--both of which produce delimited arrays as output that my procedures must process and convert to proper row/column format.

    Heh... I almost cry when I see something like that (storing delimited data) and that also includes when I see someone storing XML in a database for all the same reasons not to mention that the additioal complexity and resource usage that comes about because XML is hierarchical in nature even when used on supposedly "flat" data.

    Most of my procedures use your DelimitedSplit8K function and that has always seemed to be the fasted solution. Whenever I have to revisit an old procedure and see an older split function I replace it with a DelimitedSplit8K version if time and budget allow.

    I'm honored :blush: and glad to be able to have helped. To be clear, though, that baby isn't mine. It was actually developed over time by may people with some great inputs. The latest version (which is posted in the article itself, now), doesn't use the precise method that the article originally used. The article showed how to get rid of the concatenation and a couple of good folks in the discussion took that huge improvement and made it even faster. Truly, DelimitedSplit8K is a community effort of which I'm honored to have been a part of.

    However, in this case we were dealing with a two-dimensional array situation. I posted a solution somewhere around page 2 or 3 above that I have used frequently. That method basically used DelimitedSplit8K to do two CROSS APPLYs to split the data. When speed tests compared that method to some of the other methods it did not perform as well. So I wanted to explore alternatives.

    THAT's part of the reason why I've been watching this thread with great interest. A lot of people haven't been exposed to it but a lot of the world (apparently) revolves around CSVs that have been created by spreadsheets (DoubleClick.net, for example, provides their data in such a fashion). The data is, in fact, multi-dimensional (4 dimensions, in this case) and is frequently "unknown" as to the number of columns passed (although the columns do have a pattern epressed in the column headers). I've written several custom splitters/normalizers for such data but it's always interesting (to me, ayway) to see how other people approach the problem. I agree that, although using DelimitedSplit8K in a CROSS APPLY for each dimension is a good generic solution, it's not exactly the bees-knees when it comes to performance, so other solutions are welcome.

    To wit, my inputs on this thread shouldn't be considered to be defensive in nature. I just didn't want people to think that DelimitedSplit8K is super slow just because they don't understand that it wasn't meant to be used against MAX datatypes or if it's used incorrectly. Neither do I want them to think that DelimitedSplit8K is the only way because, patently, it is not.

    I'm also (perhaps, overly) concerned when it comes to testing solutions. For example, lots of folks were using the same row of data which causes "vertical grooving" in the logic which sometimes works in favor of one method or another and gives the false impression that it's faster. Again, my inputs are only meant as susggestions and I was very happy to see someone make that particular change.

    Anyway, thanks for the feedback on what you're doing. I'll eventually get to the point where I can setup my own tests for some of the solutions on this thread and apologize for not yet doing so. I'm so far behind with things that I need to be twins to catch up. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've done some extensive testing using a modified version of the test harness from Jeff Moden's Tally Oh! article to test the various split methods in this thread. My test tables consist of randomly generated alphanumeric value pairs. Each iteration of the test creates an increasing number of value pairs per row and also increases the size of each value in each pair in an orderly progression. I did some testing with Jeff's original test data and also some testing with VARCHAR(MAX) data. But for the last run of the test (#17) it is all VARCHAR(8000).

    That last run is significant. Up to that point the CLR Split was ALWAYS fastest. Of the non-CLR methods, Mr. Magoo's XML Split was the clear winner with the advantage that it worked equally well with VARCHAR(MAX) as with VARCHAR(8000). But I decided to make some modifications to the original DelimitedSplit8K function to allow two delimiters for a 2-dimensional split. The new function DelimitedSplit8K_2DIM does the 2-way split within the function rather than trying to use the original function (or CLR) with a cross apply.

    Remarkably, this new function actually out performs the CLR with smaller rows and smaller value pairs. Even at the larger end of the scale it is still very close to the performance of the CLR Split which uses a cross apply. Like the original DelimitedSplit8K the function gets increasing less efficient with large value pairs and is, of course, limited to strings under the 8K limit. So for anything within these parameters I think this new variation seems to have the best performance. If the data must be VARCHAR(MAX) or for larger data pairs, Mr Magoo's XML Split method comes in a very close second.

    I've attached below all of my testing code (for most of which I must give credit to Jeff Moden) as well as a spreadsheet with the results of my own testing.

    Here's the new splitter function as I tested it...I'm sure someone may be able to improve it even more.

    CREATE FUNCTION [dbo].[DelimitedSplit8K_2DIM]

    (

    @pString VARCHAR(8000)

    ,@pDelimiter1 CHAR(1)

    ,@pDelimiter2 CHAR(1) = NULL

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (

    SELECT 0 UNION ALL

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS (

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter1 OR t.N = 0)

    )

    SELECT

    ItemNumber

    ,Item1

    ,Item2 = REPLACE(Item2,Item1+@pDelimiter2,'')

    FROM

    (

    SELECT

    ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1)

    ,Item1 = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter2,@pString,s.N1),0)-s.N1,8000))

    ,Item2 = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter1,@pString,s.N1),0)-s.N1,8000))

    FROM cteStart s

    ) i1

    /*

    Example usage:

    DECLARE @STR VARCHAR(8000)

    SET @STR = '123,456|789,000'

    SELECT * FROM [dbo].[DelimitedSplit8K_2DIM](@str,'|',',')

    DECLARE @STR VARCHAR(8000)

    SET @STR = '0WQDNw|aXxbzu,durPP|7yaFpK,UMERA5|FLN2G,HUdZv|QUQy5,3MbdqS|JWUgPp,F23jqp|kWbSBn,nSWunU|uh1zR,pqBJ4U|eNnZzE,jbu7R|cwd4E,1hNMC|Ru7ar'

    SELECT * FROM [dbo].[DelimitedSplit8K_2DIM](@str,',','|')

    */

  • Steven Willis (12/28/2012)


    Here's the new splitter function as I tested it...I'm sure someone may be able to improve it even more.

    Very cool. Thanks for posting that code. Just remember that the second dimension can still only have 2 elements in this case.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/28/2012)


    Steven Willis (12/28/2012)


    Here's the new splitter function as I tested it...I'm sure someone may be able to improve it even more.

    Very cool. Thanks for posting that code. Just remember that the second dimension can still only have 2 elements in this case.

    Yeah, while I was working on testing the new version I was thinking to myself how cool it would be if the function could allow any number of delimited values within "groups."

    For example:

    '12,34,56,78|12,78|34,56,78'

    or

    '12,34,56,78|12,78|34,56,78~56,67|34,23|67'

    I've had the misfortune to face such strings and splitting them can be a nightmare.

    I had one string to split lately that I had no control over that looked like this (and this is a simplified version:

    '{TeamIDs:[1,2|3,4|5,6|7,8]}

    ,{Scores:[88,70|90,56|67,70|88,87|45,52|78,77|81,70]}

    ,{Dates:[12/22/2012|12/22/2012|12/22/2012|12/22/2012|12/29/2012|12/29/2012|12/31/2012]}'

    The scores correlated to the results of each team pair AND each subsequent round of the bracket. No keys except the order of the data...which meant I had to figure out the winner of each round based on the score pairs to get the value pairs of each round so I could create an scheduled event for each game. :crazy:

    Then to make it worse, once I had the scheduled events I had to write a query to create a similar string to pass back to the application so it could display the bracket and game results. I used a coalesce function to "reverse" split the results. Perhaps some testing could be done on some "reverse" split functions?

     

  • Viewing 13 posts - 46 through 57 (of 57 total)

    You must be logged in to reply to this topic. Login to reply