XML "CSV" Splitter

Question

Post reply

XML "CSV" Splitter

Viewing 15 posts - 31 through 44 (of 44 total)

You must be logged in to reply to this topic. Login to reply

Brad Schulz Old Hand Points: 395 More actions · Answer 1

Sorry, Jeff, I had a deadline yesterday that kept me busy.

I'll re-do my tests and see what I come up with.

--Brad

Brad Schulz Old Hand Points: 395 More actions · Answer 2

Jeff,

Here's the key to the mystery as to why you and I get differing results.

All of my tests included compile time also, because I did a DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before each test.

If I get rid of those and re-do the time test (on the WHILE LOOP approach), I get the same results as you do... the Duration is pretty much equal to the CPU... 31ms or less.

Sorry for the confusion.

--Brad

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 3

Brad Schulz (9/16/2010)
Jeff,
Here's the key to the mystery as to why you and I get differing results.
All of my tests included compile time also, because I did a DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before each test.
If I get rid of those and re-do the time test (on the WHILE LOOP approach), I get the same results as you do... the Duration is pretty much equal to the CPU... 31ms or less.
Sorry for the confusion.
--Brad

Ah... got it. And in a real environment, you'd let each of these cache so you'd realize the 31 ms.

Do you also get the same results if you let the numbers table solution cache?

And, understood on the "busy" thing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 4

Peter Brinkhaus (9/13/2010)
I am not an XML Ninja either, but you can try a inline table valued function
CREATE FUNCTION dbo.Split8KX_Inline(@Parameter VARCHAR(MAX), @Delimiter VARCHAR(1))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ItemNumber,
XC.R.query('.').value('/r[1]', 'varchar(8000)') ItemValue
FROM
(
SELECT CAST('<r>'+REPLACE(@Parameter, @Delimiter, '</r><r>')+'</r>' AS XML)
) X(P)
CROSS APPLY
P.nodes('/r') XC(R)
;
I have run the following code against your test data
set statistics time on
-- Print Jeff's
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX(CsvParameter, ',')
set statistics io off
set statistics time off
and
set statistics io on
set statistics time on
print 'Inline table valued function'
select
count(*)
from
CsvTest
cross apply
dbo.Split8KX_Inline(CsvParameter, ',')
set statistics io off
set statistics time off
Here are the test results
(1 row(s) affected)
Table '#060DEAE8'. Scan count 10000, logical reads 10000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'CsvTest'. Scan count 1, logical reads 774, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 40966 ms, elapsed time = 42803 ms.
and
(1 row(s) affected)
Table 'CsvTest'. Scan count 5, logical reads 849, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 8846 ms, elapsed time = 2268 ms.
Edit: as you can see from the execution times, the inline table valued function also heavily benefits from the 4 core CPU in my PC.
Peter

I'm finally going to get to test this. I'll let you know the results I get. Thanks again for the code and for the time. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Brad Schulz Old Hand Points: 395 More actions · Answer 5

Jeff:

Using the Numbers table approach (using the code I had in my blog) with the procedure in cache results in the following:

CPU = 6486

Reads = 244155

Duration = 1887ms (avg)

The Numbers table (in SQL2008) assumes 30% selectivity (i.e. 300,000 rows) and so it has hash joins and parallelism and is just a downright inefficient plan.

Now here's the interesting thing...

If I do the same thing in SQL2005, it's also a lousy-looking plan (hash joins and parallelism, etc), but the plan has a completely different construction... and it performs waaaaay better:

CPU = 71

Reads = 805

Duration = 44

So SQL2008 is a complete dog in using the Numbers table approach.

I'll have to see if I can scrape together the plans so you can see the difference. I have to run home right now, but perhaps tomorrow I can post pictures or I'll put .SQLPLAN files on the web for you to look at if you're interested.

--Brad

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 6

Wow... I guess my machine doesn't like XML so much. I stopped the test after 10 minutes on a 10K by 100 run.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

matzer SSCrazy Eights Points: 9727 More actions · Answer 7

Jeff Moden (9/16/2010)
Wow... I guess my machine doesn't like XML so much. I stopped the test after 10 minutes on a 10K by 100 run.

Like Steve already pointed out, the SELECT COUNT(*) I tested the performance with, was a very wrong test. I improved the function by replacing the line

XC.R.query('.').value('/r[1]', 'varchar(8000)') ItemValue

with

XC.R.value('text()[1]', 'varchar(8000)') ItemValue

which gives an improvement of about 30%. Still, the inline version is horrible. I don't see much difference with your function but it is really much faster then the inline version.

Peter

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 8

Yep... sorry, Peter. I meant to say that I tested that method, as well. It's not nearly as bad as the original dot version but, you're correct, it's still pretty bad. It came up at 25 minutes on my machine. And, yeah... I actually am setting up a new machine (64bit 2 CPU laptop this time).

I've managed to collect and test a whole gambit of splitters and I'll include them in the article modernization along with their performance numbers including some of these XML splitters.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Brad Schulz Old Hand Points: 395 More actions · Answer 9

Jeff:

Did you read my blog post about XML that I mentioned earlier?:

http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html

I had the same loooonnnnng performance issues as you did until someone suggested I add a .query('.') to the derived table.

In other words, instead of

... FROM (CAST( ... AS XML)) X(P)

change it to this:

... FROM (CAST( ... AS XML).query('.')) X(P)

It makes a world of difference.

and change the value() to:

.value('(./text())[1]','varchar(8000)')

Again, check out the blog article, because it goes into all of this and explains WHY it works better.

--Brad

tfifield SSCrazy Eights Points: 9655 More actions · Answer 10

To Reply to Kevin on why the UDF rather the in line code:

I recently used this type of iTVF to build a keyword list from item descriptions from order detail records (the items were mostly bottles of wine) where I would disregard certain parts of the descriptions, such as commas and ampersands. The function created the whole keyword table of items and keywords (80,000 + rows) in under 2 seconds by cross applying it with the item table. Performance was nice, but the real reason for the UDF was to exactly duplicate the logic on which keywords were valid.

In this application the user would enter some keywords separated by a space and look for orders they had placed containing all of the keywords. This whole string was passed to a stored procedure as the search criteria. I used the same iTVF to parse the input and create a temp table for the keywords. Since the logic was encapsulated in the same function used to create the keyword/items table I never had to worry about missing a keyword on either end and the search always worked.

Todd Fifield

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 11

Brad Schulz (9/17/2010)
Jeff:
Did you read my blog post about XML that I mentioned earlier?:
http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html
I had the same loooonnnnng performance issues as you did until someone suggested I add a .query('.') to the derived table.
In other words, instead of
... FROM (CAST( ... AS XML)) X(P)
change it to this:
... FROM (CAST( ... AS XML).query('.')) X(P)
It makes a world of difference.
and change the value() to:
.value('(./text())[1]','varchar(8000)')
Again, check out the blog article, because it goes into all of this and explains WHY it works better.
--Brad

I'll check it out, Brad. And, yep... I was already aware of the text()[1] thingy but it's nice to have confirmation.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004439 More actions · Answer 12

Hey folks... especially BRAD... would you mind participating in a little test at the following URL please? Thanks.

http://www.sqlservercentral.com/Forums/Topic988784-391-1.aspx

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 13

Hmm. How'd I miss this thread? :w00t:

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

matzer SSCrazy Eights Points: 9727 More actions · Answer 14

Paul White NZ (9/18/2010)
Hmm. How'd I miss this thread? :w00t:

Because of being busy writing those great optimizer articles?

Paul White SSC Guru Points: 150468 More actions · Answer 15

Peter Brinkhaus (9/18/2010)
Paul White NZ (9/18/2010)
Hmm. How'd I miss this thread? :w00t:
Because of being busy writing those great optimizer articles?

:blush:

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi