XML "CSV" Splitter

  • Sorry, Jeff, I had a deadline yesterday that kept me busy.

    I'll re-do my tests and see what I come up with.

    --Brad

  • Jeff,

    Here's the key to the mystery as to why you and I get differing results.

    All of my tests included compile time also, because I did a DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before each test.

    If I get rid of those and re-do the time test (on the WHILE LOOP approach), I get the same results as you do... the Duration is pretty much equal to the CPU... 31ms or less.

    Sorry for the confusion.

    --Brad

  • Brad Schulz (9/16/2010)


    Jeff,

    Here's the key to the mystery as to why you and I get differing results.

    All of my tests included compile time also, because I did a DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before each test.

    If I get rid of those and re-do the time test (on the WHILE LOOP approach), I get the same results as you do... the Duration is pretty much equal to the CPU... 31ms or less.

    Sorry for the confusion.

    --Brad

    Ah... got it. And in a real environment, you'd let each of these cache so you'd realize the 31 ms.

    Do you also get the same results if you let the numbers table solution cache?

    And, understood on the "busy" thing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Peter Brinkhaus (9/13/2010)


    I am not an XML Ninja either, but you can try a inline table valued function

    CREATE FUNCTION dbo.Split8KX_Inline(@Parameter VARCHAR(MAX), @Delimiter VARCHAR(1))

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    SELECT

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ItemNumber,

    XC.R.query('.').value('/r[1]', 'varchar(8000)') ItemValue

    FROM

    (

    SELECT CAST('<r>'+REPLACE(@Parameter, @Delimiter, '</r><r>')+'</r>' AS XML)

    ) X(P)

    CROSS APPLY

    P.nodes('/r') XC(R)

    ;

    I have run the following code against your test data

    set statistics time on

    -- Print Jeff's

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.Split8KX(CsvParameter, ',')

    set statistics io off

    set statistics time off

    and

    set statistics io on

    set statistics time on

    print 'Inline table valued function'

    select

    count(*)

    from

    CsvTest

    cross apply

    dbo.Split8KX_Inline(CsvParameter, ',')

    set statistics io off

    set statistics time off

    Here are the test results

    (1 row(s) affected)

    Table '#060DEAE8'. Scan count 10000, logical reads 10000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'CsvTest'. Scan count 1, logical reads 774, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 40966 ms, elapsed time = 42803 ms.

    and

    (1 row(s) affected)

    Table 'CsvTest'. Scan count 5, logical reads 849, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 8846 ms, elapsed time = 2268 ms.

    Edit: as you can see from the execution times, the inline table valued function also heavily benefits from the 4 core CPU in my PC.

    Peter

    I'm finally going to get to test this. I'll let you know the results I get. Thanks again for the code and for the time. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff:

    Using the Numbers table approach (using the code I had in my blog) with the procedure in cache results in the following:

    CPU = 6486

    Reads = 244155

    Duration = 1887ms (avg)

    The Numbers table (in SQL2008) assumes 30% selectivity (i.e. 300,000 rows) and so it has hash joins and parallelism and is just a downright inefficient plan.

    Now here's the interesting thing...

    If I do the same thing in SQL2005, it's also a lousy-looking plan (hash joins and parallelism, etc), but the plan has a completely different construction... and it performs waaaaay better:

    CPU = 71

    Reads = 805

    Duration = 44

    So SQL2008 is a complete dog in using the Numbers table approach.

    I'll have to see if I can scrape together the plans so you can see the difference. I have to run home right now, but perhaps tomorrow I can post pictures or I'll put .SQLPLAN files on the web for you to look at if you're interested.

    --Brad

  • Wow... I guess my machine doesn't like XML so much. I stopped the test after 10 minutes on a 10K by 100 run.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (9/16/2010)


    Wow... I guess my machine doesn't like XML so much. I stopped the test after 10 minutes on a 10K by 100 run.

    Like Steve already pointed out, the SELECT COUNT(*) I tested the performance with, was a very wrong test. I improved the function by replacing the line

    XC.R.query('.').value('/r[1]', 'varchar(8000)') ItemValue

    with

    XC.R.value('text()[1]', 'varchar(8000)') ItemValue

    which gives an improvement of about 30%. Still, the inline version is horrible. I don't see much difference with your function but it is really much faster then the inline version.

    Peter

  • Yep... sorry, Peter. I meant to say that I tested that method, as well. It's not nearly as bad as the original dot version but, you're correct, it's still pretty bad. It came up at 25 minutes on my machine. And, yeah... I actually am setting up a new machine (64bit 2 CPU laptop this time).

    I've managed to collect and test a whole gambit of splitters and I'll include them in the article modernization along with their performance numbers including some of these XML splitters.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff:

    Did you read my blog post about XML that I mentioned earlier?:

    http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html

    I had the same loooonnnnng performance issues as you did until someone suggested I add a .query('.') to the derived table.

    In other words, instead of

    ... FROM (CAST( ... AS XML)) X(P)

    change it to this:

    ... FROM (CAST( ... AS XML).query('.')) X(P)

    It makes a world of difference.

    and change the value() to:

    .value('(./text())[1]','varchar(8000)')

    Again, check out the blog article, because it goes into all of this and explains WHY it works better.

    --Brad

  • To Reply to Kevin on why the UDF rather the in line code:

    I recently used this type of iTVF to build a keyword list from item descriptions from order detail records (the items were mostly bottles of wine) where I would disregard certain parts of the descriptions, such as commas and ampersands. The function created the whole keyword table of items and keywords (80,000 + rows) in under 2 seconds by cross applying it with the item table. Performance was nice, but the real reason for the UDF was to exactly duplicate the logic on which keywords were valid.

    In this application the user would enter some keywords separated by a space and look for orders they had placed containing all of the keywords. This whole string was passed to a stored procedure as the search criteria. I used the same iTVF to parse the input and create a temp table for the keywords. Since the logic was encapsulated in the same function used to create the keyword/items table I never had to worry about missing a keyword on either end and the search always worked.

    Todd Fifield

  • Brad Schulz (9/17/2010)


    Jeff:

    Did you read my blog post about XML that I mentioned earlier?:

    http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html

    I had the same loooonnnnng performance issues as you did until someone suggested I add a .query('.') to the derived table.

    In other words, instead of

    ... FROM (CAST( ... AS XML)) X(P)

    change it to this:

    ... FROM (CAST( ... AS XML).query('.')) X(P)

    It makes a world of difference.

    and change the value() to:

    .value('(./text())[1]','varchar(8000)')

    Again, check out the blog article, because it goes into all of this and explains WHY it works better.

    --Brad

    I'll check it out, Brad. And, yep... I was already aware of the text()[1] thingy but it's nice to have confirmation.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hey folks... especially BRAD... would you mind participating in a little test at the following URL please? Thanks.

    http://www.sqlservercentral.com/Forums/Topic988784-391-1.aspx

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hmm. How'd I miss this thread? :w00t:

  • Paul White NZ (9/18/2010)


    Hmm. How'd I miss this thread? :w00t:

    Because of being busy writing those great optimizer articles?

  • Peter Brinkhaus (9/18/2010)


    Paul White NZ (9/18/2010)


    Hmm. How'd I miss this thread? :w00t:

    Because of being busy writing those great optimizer articles?

    :blush:

Viewing 15 posts - 31 through 44 (of 44 total)

You must be logged in to reply to this topic. Login to reply