Performance issue with tally solution

Question

Performance issue with tally solution

Viewing 15 posts - 226 through 240 (of 522 total)

You must be logged in to reply to this topic. Login to reply

Paul White SSC Guru Points: 150468 More actions · Answer 1

Phil Factor (4/25/2009)
Most data feeds are a bit more regular, I'll admit, but then it is great for testing text parsing routines. I've had to parse larger strings than this before now. (e.g. daily feed from a telecommunications switch) though I've usually used BCP to do the line-splitting for me. I've been more interested in doing this sort of thing for website files for fast indexing, recently.

Phil,

The graphs are interesting, but run somewhat contrary to other test results in this thread. I would once again invite you to clarify the exact code for the Quirky and While methods, together with the core code for the test rig. It is frustrating not to be able to validate your findings for myself.

As it stands, I am forming the opinion that Flo's CLR-with-chars method or the basic CLR TVF is the best general solution, as it performs best, or second best in all of the tests he ran, including Moby Dick.

A quick post from you containing the exact code for Quirky and While would be great.

Paul

edit: P.S. I had to import 3TB of web log files last week. I have always found Microsoft LogParser to be a good solution - no point in re-inventing the wheel, eh? 😉

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 2

Paul White (4/25/2009)
By the way, reading the whole data column (all rows at once) inside the CLR TVF works very well. Splitting that data also works well. However, writing the results to the database down the SqlPipe is horrible.

*affirmative*

I really don't what they do when they try to get the data back to server.

It's a shame, because the plan was to read the CSV data into a streaming SqlDataReader, writing the results back down the context connection pipe through the streaming CLR TVF interface.
The problem is that one is only allowed to do data operations in the InitMethod part of the CLR function. If you try to create a SqlDataReader in Init and then read it on the streaming side (while being called through IEnumerator) the SqlDataReader is closed and all attempts to access data fail. Annoying.
It would have been great because one could just write:
SELECT * FROM dbo.My_Streaming_TVF ('dbo.DataTable', 'KeyColumnName', 'DataColumnName')and you would get all the results at once, with just one call through reflection (or whatever it uses to bind to the IEnumerator), and both reading and writing would have been streaming, and therefore very scalable. Oh well, never mind.

Also completely confirm this! Since it is not possible to use input streaming while output streaming you always have to ensure that the count of data is not to much to be handled in one step. It would be great if this becomes possible in future version. SQL Server 2008 also doesn't support...

Greets

Flo

Phil Factor SSC-Insane Points: 20254 More actions · Answer 3

[p]I've already posted the Quirky and While solutions, but I put the harness in just to make the post more interesting to others.[/p]

[font="Courier New"]ALTER FUNCTION [dbo].[splitStringToTable](@String NVARCHAR(MAX), @Delimiter VARCHAR(255))

/*

This splits a string based on a delimiting substring. It returns a table of the resulting strings, and the locations within the original string.

This technique uses the 'Quirky Update' on a table-update. A table is created with the correct number of delimiters within the string, and is then updated to contain the locations of all the delimiters, and the actual text between the delimiters.

I suspect that there is an improvement to be made somewhere as it should be as fast as the WHILE loop.

*/

RETURNS @strings TABLE

(

line_no INT IDENTITY(1,1),

theIndex INT DEFAULT 0,

previousIndex INT DEFAULT 0,

TheLine VARCHAR(MAX) DEFAULT '')

AS

-- body of the function

BEGIN

DECLARE @TheIndex INT, @Previousindex INT,@LenDelimiter INT

SELECT @TheIndex=-1, @LenDelimiter=LEN(@Delimiter)

--create the string table unfilled but the right length

INSERT INTO @strings(TheIndex)

SELECT 0 FROM numbers

WHERE number <=(LEN(@String)

-LEN(REPLACE(@String,@Delimiter,'')))/LEN(@Delimiter)+1

--and populate the table

UPDATE @strings

SET @PreviousIndex=PreviousIndex=@TheIndex,

@TheIndex=theIndex=CASE WHEN @PreviousIndex<>0

THEN CHARINDEX(@Delimiter,@String,@PreviousIndex+1)ELSE 0 END,

TheLine=CASE WHEN @TheIndex>0 THEN

SUBSTRING(@String,@PreviousIndex+@LenDelimiter,

@TheIndex-@previousindex-@LenDelimiter)

WHEN @PreviousIndex>0 THEN

SUBSTRING(@String,@PreviousIndex+@LenDelimiter,

LEN(@String)-@previousindex)

ELSE '' END

RETURN

END

[/font]

[p] ...and the while loop...[/p]

[font="Courier New"]GO

/*

This splits a string based on a delimiting substring. It returns a table of the resulting strings, and the locations within the original string.

This technique uses the 'While Loop'.

*/

ALTER FUNCTION [dbo].[WhileSplit](@String NVARCHAR(MAX), @Delimiter VARCHAR(255))

RETURNS @strings TABLE

(

line_no INT IDENTITY(1,1),

theIndex INT DEFAULT 0,

previousIndex INT DEFAULT 0,

TheLine VARCHAR(MAX) DEFAULT '')

AS

-- body of the function

BEGIN

DECLARE @pos INT, @next INT, @LenDelimiter INT

SELECT @pos = 1, @lenDelimiter=LEN(@Delimiter),

@next = CHARINDEX(@Delimiter, @String, @pos)

WHILE (@next > 0)

BEGIN

INSERT INTO @Strings (PreviousIndex,TheIndex,TheLine)

SELECT @pos, @next, SUBSTRING(@String, @pos, @next - @pos)

SELECT @pos = @next + @lenDelimiter,

@next = CHARINDEX(@Delimiter, @String, @pos)

END

INSERT INTO @Strings (PreviousIndex,TheIndex,TheLine)

SELECT @pos, @next, SUBSTRING(@String, @pos, LEN(@STRING)-@POS+1)

RETURN

END

[/font]

[p] and the current test harness (this is something I keep fiddling with[/p]

[font="Courier New"]--create a log table to record timings and results

DECLARE @log TABLE

(

Log_ID INT IDENTITY(1, 1),

Event VARCHAR(40),

param1 INT, param2 INT,

InsertionDate DATETIME DEFAULT GETDATE()

)

--various variables

DECLARE @LotsOfText VARCHAR(MAX),

@TestText VARCHAR(MAX),

@Test VARCHAR(80),

@ii INT,

@lentext INT,

@lenString INT,

@linesFound INT

--flag the start of the test run in the log

INSERT INTO @log (event,param1,param2) SELECT 'Start Test run', NULL, NULL

--have we a data table? if not then create it

IF NOT EXISTS (SELECT 1

FROM information_schema.tables

WHERE table_name='SampleText')

CREATE TABLE SampleText(SampleText_ID INT IDENTITY(1,1),

TheText VARCHAR(MAX))

--have we data in the table. If not then read it in.

IF NOT EXISTS (SELECT 1 FROM sampleText)

BEGIN

SELECT @LotsOfText = BulkColumn

-- 2005/8 only for this code)

FROM OPENROWSET(BULK 'c:\workbench\moby-dick.txt',

SINGLE_BLOB) AS x

INSERT INTO SampleText(TheText) SELECT @lotsOfText

END

SELECT @LotsOfText=TheText FROM SampleText WHERE SampleText_ID=1

SELECT @ii=10 --100%, 1=10%. Start with all, then decrease.

SELECT @LenString=LEN(@LotsOfText)--for calulating chop of data.

WHILE @ii>0

BEGIN

--start by creating the size of string for the test

SELECT @TestText =

CASE @ii WHEN 10 THEN @LotsOfText

ELSE SUBSTRING( @LotsOfText,1, CAST((@lenString*(@ii/10.00)) AS INT))

END

SELECT @LenText=LEN(@TestText)

DECLARE @bucket TABLE

(

line VARCHAR(8000)

)

SELECT @LinesFound=COUNT(*), @Test='Tally'

FROM dbo.SplitTally(@TestText)

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test, @LenText, @LinesFound

INSERT INTO @bucket (line) SELECT line

FROM dbo.SplitTally(@TestText)

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test+ ' insert', @LenText, @LinesFound

DECLARE @bucket1 TABLE

(

line VARCHAR(8000)

)

SELECT @LinesFound=COUNT(*), @Test='Regex Split'

FROM dbo.RegExSplit (dbo.regexescape('

'),@TestText,1)

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test, @LenText, @LinesFound

INSERT INTO @bucket1 (line) SELECT MATCH

FROM dbo.RegExSplit (dbo.regexescape('

'),@TestText,1)

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test+ ' insert', @LenText, @LinesFound

DECLARE @bucket2 TABLE

(

line VARCHAR(8000)

)

SELECT @LinesFound=COUNT(*) , @Test='Quirky Update'

FROM dbo.splitStringToTable(@TestText,'

')

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test, @LenText, @LinesFound

INSERT INTO @bucket2 (line) SELECT Theline

FROM dbo.splitStringToTable(@TestText,'

')

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test+ ' insert', @LenText, @LinesFound

DECLARE @bucket3 TABLE

(

line VARCHAR(8000)

)

SELECT @LinesFound=COUNT(*) , @Test='CLR Split'

FROM dbo.DanielJ(@TestText)

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test, @LenText, @LinesFound

INSERT INTO @bucket3 (line) SELECT match

FROM dbo.DanielJ(@TestText)

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test+ ' insert', @LenText, @LinesFound

DECLARE @bucket4 TABLE

(

line VARCHAR(8000)

)

SELECT @LinesFound=COUNT(*) , @Test='WHILE Split'

FROM WhileSplit(@TestText,'

')

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test, @LenText, @LinesFound

INSERT INTO @bucket4 (line) SELECT theLine

FROM WhileSplit(@TestText,'

') INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test+ ' insert', @LenText, @LinesFound

SELECT @LinesFound=COUNT(*) , @Test='Double-barrel tally'

FROM DoubleBarreledTally(@TestText,'

')

INSERT INTO @log (event,param1,Param2) --report time

SELECT @Test, @LenText, @LinesFound

INSERT INTO @bucket4 (line) SELECT theLine

FROM DoubleBarreledTally(@TestText,'

')

innsert INTO @log (event,param1,Param2) --report time

SELECT @Test+ ' insert', @LenText, @LinesFound

SELECT @ii=@ii-1

END

--Now see how long it took...

SELECT event,

[length of string] = param1,

[No. Lines] =param2,

[Time (Ms)] = DATEDIFF(ms,

(SELECT [InsertionDate]

FROM @log f

WHERE f . Log_ID = g . Log_ID - 1

),

[InsertionDate]

)

FROM @log g

WHERE Log_ID >1

GO

[/font]

Best wishes,
Phil Factor

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 4

Hi Phil

This results would be great for cursors! As Paul already wrote could you point me to the "Quirky Update" and "While Loop"? Then I can include it into the other tests.

Thanks a lot!

Flo

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 5

Paul White (4/25/2009)
- no point in re-inventing the wheel, eh? 😉

One of the hobbies of my chief. We have an own string, DateTime, List, DB-Connection, even an own File... :hehe:

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 6

Phil Factor (4/26/2009)
I've already posted the Quirky and While solutions, but I put the harness in just to make the post more interesting to others.

Thanks! I will include it into my other test when I'm back!

... now it's time for barbecue 🙂

Greets

Flo

Paul White SSC Guru Points: 150468 More actions · Answer 7

A quick post before bed concerning the TVFs posted by Phil.

Unless I am missing something obvious, this solution is heavily dependent on the TVF only being called once.

This is achieved in the script by reading the whole of Moby Dick into a NVARCHAR(MAX) variable.

Changing it's datatypes to VARCHAR(8000) on the way in and VARCHAR(100) out, and running it against the first 200 lines of Jeff's test table:

ALTER FUNCTION [dbo].[WhileSplit](@String VARCHAR(8000), @Delimiter VARCHAR(255))

*snip*

TheLine VARCHAR(100) DEFAULT '')

*snip*

GO

SELECTWS.line_no, WS.theLine

INTOdbo.Result

FROM(SELECT TOP 200 * FROM dbo.JBMTest) AS J

CROSS

APPLYdbo.WhileSplit (J.CSV, ',') AS WS

Table 'Worktable'. Scan count 1, logical reads 363973, physical reads 566, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#0BC6C43E'. Scan count 200, logical reads 800, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'JBMTest'. Scan count 1, logical reads 203, physical reads 152, read-ahead reads 479, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(160200 row(s) affected)

SQL Server Execution Times:

CPU time = 141484 ms, elapsed time = 189769 ms.

I ran it on the first 200 rows only because of time constraints. As it is, it took over three minutes...

The actual and estimated execution plans are attached. The actual plan doesn't show the inner execution of the TVF so that's why the estimated one is there.

Cheers,

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 8

Florian Reischl (4/26/2009)
Paul White (4/25/2009)
- no point in re-inventing the wheel, eh? 😉
One of the hobbies of my chief. We have an own string, DateTime, List, DB-Connection, even an own File... :hehe:

I have no words to adequately convey my feelings on that :crazy:

:sigh:

You have my sympathy!

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 9

Hey Paul, I followed your link about the new flag. My son has lived in Dunedin for years now, but never mentioned this to me. As an American, I can cast no vote, but personally I favor the Silver Fern. 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 10

Hi Phil

I just ran exactly your script on my system. I just removed the Tally version because this is definitely the wrong decision for this requirement.

The chart is attached. Hope it will also be shown here...

I don't know why, but there seems to be a huge difference in the execution between our systems.

Edit: Please forgive the lousy formatting. I'm no Excel guru...

Greets

Flo

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 11

Wow... Just looked back at some of the graphs.... I've gotta find out which While Loop and Tally table methods Phil used to come up with his graphs because those numbers just seem really odd to me. Also, if I recall correctly, when Phil and I did some testing on Moby Dick a while back, the while loop won on his machine while the Tally table one on mine.

Heh... The "split" function is certainly turning out to be a huge "it depends". Depends on the machine, depends on delimiter density, depends on delimiter randomness between rows, etc, etc, etc. Flo, I don't envy you in the task of writing and article about all of this.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 12

Bob Hovious (4/26/2009)
Hey Paul, I followed your link about the new flag. My son has lived in Dunedin for years now, but never mentioned this to me. As an American, I can cast no vote, but personally I favor the Silver Fern. 🙂

Yeah, me too - I had it as my avatar for a while, but figured I should spread the love a little.

The Silver Fern would make one cool flag - the only slight downside would be the licensing costs to the All Blacks...

I think the 'new flag' guys welcome support from whatever quarter, so thanks for that.

@Flo

Regex is pretty awesome for long strings eh?

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 13

[font="Verdana"]I like the southern cross, and I like the silver fern and I like the kiwi. They're all our national symbols. I don't see why the flag need change (unless we become a republic, which at the moment doesn't seem to be a serious thing.)

Now, back to the excellent code!

I did have one thought, although I don't see it as practical in this case. When I was at University we were doing performance analysis on sorting algorithms (which I believe is a classic). We noticed that for small input, even a bubble sort would beat a quick sort, but a bubble sort soon fell out of fashion as the input size increased. However, a quick sort works by chunking the input, and as the chunk sizes got smaller, you could swap sort algorithms to get the best performance.

Would it be possible to use a hybrid solution? I don't think so, because it appears that the result can't be "chunked" (or partitioned) in any way. It also seems that the two CLR solutions are still performing well.

[/font]

Phil Factor SSC-Insane Points: 20254 More actions · Answer 14

Hmmm.. I'm really scratching my head now wondering about the differences between the two test runs, Flo. At least we can agree that the Tally doesn't cut it when the delimiters are widely spaced. With my system, both the regex solutions (Yours and mine) produced identical dire performance, so bad I took them out of the test because they made the test runs take far too long.

If we can find out what makes the CLR solutions run so fast on your tests, we ought to bottle it and sell it to everyone. There is a lot of work to do.

In answer to paul's comments about the cross apply, It doesn't surprise me that it the CROSS APPLY runs like a fly in syrup. I have Quirky and WHILE loop solutions to Flo's original problem, but I got a bit distracted when I got interested in the relative performance of various TVFs on large strings. I'll dust them out.

Best wishes,
Phil Factor

Paul White SSC Guru Points: 150468 More actions · Answer 15

@Phil, Flo:

Probably the easiest way would be to compare your respective .sqlplan actual execution plans for the queries in question, together with the usual CPU and IO information...?

@bruce-2:

Don't get me wrong - I like the existing flag, it's just a bit similar to another I could mention, and doesn't have anything very NZ about it apart from the Southern Cross. A design that incorporates a little more of NZ without throwing away the good things on the current flag would be perfect. I just think the present balance is a little off.

Paul

P.S. One obvious disadvantage of the CLR solutions which I forgot before is that T-SQL is more easily re-used. Copying a procedure or function definition in T-SQL is easier and faster than creating an assembly and a T-SQL stub. This is the main reason I still use a ROW_NUMBER() generated numbers table for quick one-off string splits.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi