XML "CSV" Splitter

  • @dave-3,

    Yep... understood especially on the plan. And thanks for the code. I'll go though it tonight.

    I did make an entry on Brad's forum that there might be something wrong with his computer. I'm using 1/2 the amount of CPU, 1/10th the duration, and 1/1000 (approx) the number of reads that he is for the exact same code (his While Loop code).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @Wayne,

    Yep... I was getting similar results (didn't want to post my results because I didn't want to "taint" people's opinions).

    This is why I say I'm looking for a much better XML splitter. Peter's Split8KX_Inline XML function appears to be the one I was apparently looking for. Like I said, I've got some testing to do tonight.

    Thanks for the help, folks. I really appreciate it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • WayneS (9/13/2010)


    Test code:

    SET NOCOUNT ON;

    print replicate('*',30);

    Print 'Jeff''s function';

    print replicate('*',30);

    SET STATISTICS IO,TIME ON;

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.Split8KX(CsvParameter, ',');

    SET STATISTICS IO,TIME OFF;

    print replicate('*',30);

    print 'Inline table valued function';

    print replicate('*',30);

    SET STATISTICS IO,TIME ON;

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.Split8KX_Inline(CsvParameter, ',');

    SET STATISTICS IO,TIME OFF;

    print replicate('*',30);

    print 'DelimitedSplit8k table valued function';

    print replicate('*',30);

    SET STATISTICS IO,TIME ON;

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.DelimitedSplit8K(CsvParameter, ',');

    SET STATISTICS IO,TIME OFF;

    print replicate('*',30);

    print 'DelimitedSplit8k table valued function';

    print replicate('*',30);

    SET STATISTICS IO,TIME ON;

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.DelimitedSplit8K(CsvParameter, ',');

    SET STATISTICS IO,TIME OFF;

    I'm not sure the COUNT(*) test is the best way of measuring this.

    What do get if you measure the sort of query below which is actually shifting data?

    SELECT C.RowNum, Z.ItemNumber, Z.ItemValue

    INTO #split

    FROM #CsvTest AS C

    CROSS APPLY dbo.Split8KX_Inline(C.CsvParameter, ',') AS Z

  • steve-893342 (9/13/2010)


    WayneS (9/13/2010)


    Test code:

    SET NOCOUNT ON;

    print replicate('*',30);

    Print 'Jeff''s function';

    print replicate('*',30);

    SET STATISTICS IO,TIME ON;

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.Split8KX(CsvParameter, ',');

    SET STATISTICS IO,TIME OFF;

    print replicate('*',30);

    print 'Inline table valued function';

    print replicate('*',30);

    SET STATISTICS IO,TIME ON;

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.Split8KX_Inline(CsvParameter, ',');

    SET STATISTICS IO,TIME OFF;

    print replicate('*',30);

    print 'DelimitedSplit8k table valued function';

    print replicate('*',30);

    SET STATISTICS IO,TIME ON;

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.DelimitedSplit8K(CsvParameter, ',');

    SET STATISTICS IO,TIME OFF;

    print replicate('*',30);

    print 'DelimitedSplit8k table valued function';

    print replicate('*',30);

    SET STATISTICS IO,TIME ON;

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.DelimitedSplit8K(CsvParameter, ',');

    SET STATISTICS IO,TIME OFF;

    I'm not sure the COUNT(*) test is the best way of measuring this.

    What do get if you measure the sort of query below which is actually shifting data?

    SELECT C.RowNum, Z.ItemNumber, Z.ItemValue

    INTO #split

    FROM #CsvTest AS C

    CROSS APPLY dbo.Split8KX_Inline(C.CsvParameter, ',') AS Z

    You are absolutely right. I used count(*) to minimize the data transfer to SSMS while testing the functions and didn't look at the execution plans. I tested with your SELECT INTO variant and the performance of the inline version dropped dramatically compared to the table valued function. Quite surprising though.

    Peter

  • ---- EDIT bum steer



    Clear Sky SQL
    My Blog[/url]

  • Hi Jeff...

    Just wanted to quickly respond... I've been busy all day (mostly writing my T-SQL Tuesday Blog Post).

    Regarding the plethora of CROSS APPLYs in my code for the Tally Table: They don't make any difference... they might take an extra nanosecond or two to compile, but the query plan is exactly the same as not having the CROSS APPLYs. I put them there for readability so that the individual steps are evident. You can read more about the non-impact of CROSS APPLY at the beginning of this post: http://bradsruminations.blogspot.com/2009/07/cool-cross-apply-tricks-part-2.html

    Regarding XML Splitting: My roundup post that Dave referred to has the most optimized XML approach that I've come across, based on my own research and some input from others. You can read more detail here: http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html

    That being said, though, I haven't had a chance to look closer at Peter's Split8KX_Inline function yet. I am also intrigued by what you said regarding VARCHAR(MAX)... it might be interesting to see if changing my function to accept a VARCHAR(8000) instead makes a difference or not.

    Also, regarding my time measurements: I'll see if I can take another look, both on my current machine and an older one. FWIW, when I ran the tests, it was on AdventureWorks (not AdventureWorks2008) using SQL2008 (version 10.0.2531). Don't know if that makes a difference or not. Again, I'll try to re-run and see if anything different comes up.

    I'm looking forward to your article.

    --Brad

  • Can I ask why everyone is putting their code inside a UDF? Can't the code just be inlined as necessary?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (9/13/2010)


    Can I ask why everyone is putting their code inside a UDF? Can't the code just be inlined as necessary?

    The only explanation I can muster for that is code reuse and avoidance of copy and paste of the same code all over the database. It makes sense along those lines in case anything ever needs to be changed - then one of the pieces of code would not be missed.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Brad Schulz (9/13/2010)


    Hi Jeff...

    Just wanted to quickly respond... I've been busy all day (mostly writing my T-SQL Tuesday Blog Post).

    Regarding the plethora of CROSS APPLYs in my code for the Tally Table: They don't make any difference... they might take an extra nanosecond or two to compile, but the query plan is exactly the same as not having the CROSS APPLYs. I put them there for readability so that the individual steps are evident. You can read more about the non-impact of CROSS APPLY at the beginning of this post: http://bradsruminations.blogspot.com/2009/07/cool-cross-apply-tricks-part-2.html

    Regarding XML Splitting: My roundup post that Dave referred to has the most optimized XML approach that I've come across, based on my own research and some input from others. You can read more detail here: http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html

    That being said, though, I haven't had a chance to look closer at Peter's Split8KX_Inline function yet. I am also intrigued by what you said regarding VARCHAR(MAX)... it might be interesting to see if changing my function to accept a VARCHAR(8000) instead makes a difference or not.

    Also, regarding my time measurements: I'll see if I can take another look, both on my current machine and an older one. FWIW, when I ran the tests, it was on AdventureWorks (not AdventureWorks2008) using SQL2008 (version 10.0.2531). Don't know if that makes a difference or not. Again, I'll try to re-run and see if anything different comes up.

    I'm looking forward to your article.

    --Brad

    Thanks for the feedback, Brad.

    I did test data type changes with your While Loop. It doesn't seem to make difference whether MAX or not there (as we both would have assumed but needed to check). 🙂 I believe the only time it'll make much of an impact is when you're trying to do a join as with a numbers table like it did for another function I was writing. I haven't actually tested your code yet to see if it loses it on the datatype change. I did, however, check the cross apply's in your code... they actually DO make a surprise difference... they cut the duration of the code down to about a third for the task at hand when compared to a "traditional Tally Splitter". In other words, the cross apply's seem to make the code even more performant. Still doing some testing there, as well. The While Loop still won on my box because it takes less time to split 3 elements than it does to do the initial join on a Tally table. It'll be interesting to see how it fairs on 100 elements for 10k rows.

    Yeah... having a data mismatch on CHARINDEX vs the Tally table can make a huge difference (more than 2 minutes vs 1 second on some of the testing I've been doing) because it does force a cartesian join instead of allowing seeks to happen.

    So far as the article goes, you're correct... people have to be real careful about just taking folks word for something even when code is posted (kinda of like what you cited in your explanation about where you got the numbers solution from). The "subtractive" code you used for the While Loop is (as you inferred) old but has always been some of the fastest While Loop code there is for splitting. I'd seen it's slightly fatter brother (start/stop positional code) and was glad to come across your loop because I couldn't remember exactly how to do it with just the "start". I'm going to run it through some tests along with your "accidental" nasty fast cross apply code. My first impression was that it was going to be a bit slower, but nay! It's actually very fast (at least on my humble desktop). That's why I was surprised to see you say it was slow. Did you actually measure the performance or are you just going by the execution plan (which I've found can lie like a rug when it comes to estimates) and Dave seems to think?

    I'm still running 2k5 sp3 with no CU's... I hope they haven't messed with the optimizer in 2k8 too badly. It sounds like they may have because While Loops are actually rock steady yet it's taking almost 10 times longer on your box than on mine and I'd just about bet credits to Navy beans that the machine you're running on is better than mine.

    Ah...almost forgot. If you want it, feel free to help yourself to the test data generation code I wrote. It's a bit slow (a whopping 8 seconds on my 1.8GHz P4 to create 100 elements each in 10,000 rows for a total of only a million elements) because of all the stuff it has to do but it makes for some really good testing because the commas are all over the place vertically.

    Last but not least... with everything else going on, I forgot to even look at your XML code which was the original purpose of this code. I'll give it a whirl. Thanks, Brad.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TheSQLGuru (9/13/2010)


    Can I ask why everyone is putting their code inside a UDF? Can't the code just be inlined as necessary?

    Actually, the testing I'm doing works against both methods (UDF and non-UDF). I showed a function in hopes that someone would come up with an iTVF which also makes some pretty good code for regular non-UDF code, as well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Dave Ballantyne (9/13/2010)


    ---- EDIT bum steer

    Heh... "Mad cow"? 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TheSQLGuru (9/13/2010)


    Can I ask why everyone is putting their code inside a UDF? Can't the code just be inlined as necessary?

    Not sure, I almost invariably bring my splitter code in line and customise to requirements

  • Jeff Moden (9/13/2010)


    Dave Ballantyne (9/13/2010)


    ---- EDIT bum steer

    Heh... "Mad cow"? 😀

    More like , hit post , oops thats all nonsense 🙂 Hopefully i deleted before google cached it 😀

    Another factor that should not be discounted is that using a scalar udf will mean that parallelism can not be used.



    Clear Sky SQL
    My Blog[/url]

  • I left some notes on Brad's forum that Dave provided a link to. I can't (yet) figure out what he may have done differently than me because I'm sure not seeing the same level of disparity that he is. We've seen this type of behaviour when different collations are set but I don't know if that's it or not until Brad gets back to me with what his collation settings are.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I can't even get things to slow down by changing to some "unusual" collations when building the data table. Again, I don't know why Brad is getting such a huge disparity when I'm not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 44 total)

You must be logged in to reply to this topic. Login to reply