April 25, 2009 at 7:48 pm
Phil Factor (4/25/2009)
Most data feeds are a bit more regular, I'll admit, but then it is great for testing text parsing routines. I've had to parse larger strings than this before now. (e.g. daily feed from a telecommunications switch) though I've usually used BCP to do the line-splitting for me. I've been more interested in doing this sort of thing for website files for fast indexing, recently.
Phil,
The graphs are interesting, but run somewhat contrary to other test results in this thread. I would once again invite you to clarify the exact code for the Quirky and While methods, together with the core code for the test rig. It is frustrating not to be able to validate your findings for myself.
As it stands, I am forming the opinion that Flo's CLR-with-chars method or the basic CLR TVF is the best general solution, as it performs best, or second best in all of the tests he ran, including Moby Dick.
A quick post from you containing the exact code for Quirky and While would be great.
Paul
edit: P.S. I had to import 3TB of web log files last week. I have always found Microsoft LogParser to be a good solution - no point in re-inventing the wheel, eh? 😉
April 26, 2009 at 4:10 am
Paul White (4/25/2009)
By the way, reading the whole data column (all rows at once) inside the CLR TVF works very well. Splitting that data also works well. However, writing the results to the database down the SqlPipe is horrible.
*affirmative*
I really don't what they do when they try to get the data back to server.
It's a shame, because the plan was to read the CSV data into a streaming SqlDataReader, writing the results back down the context connection pipe through the streaming CLR TVF interface.
The problem is that one is only allowed to do data operations in the InitMethod part of the CLR function. If you try to create a SqlDataReader in Init and then read it on the streaming side (while being called through IEnumerator) the SqlDataReader is closed and all attempts to access data fail. Annoying.
It would have been great because one could just write:
SELECT * FROM dbo.My_Streaming_TVF ('dbo.DataTable', 'KeyColumnName', 'DataColumnName')
and you would get all the results at once, with just one call through reflection (or whatever it uses to bind to the IEnumerator), and both reading and writing would have been streaming, and therefore very scalable. Oh well, never mind.
Also completely confirm this! Since it is not possible to use input streaming while output streaming you always have to ensure that the count of data is not to much to be handled in one step. It would be great if this becomes possible in future version. SQL Server 2008 also doesn't support...
Greets
Flo
April 26, 2009 at 4:13 am
[p]I've already posted the Quirky and While solutions, but I put the harness in just to make the post more interesting to others.[/p]
[font="Courier New"]ALTER FUNCTION [dbo].[splitStringToTable](@String NVARCHAR(MAX), @Delimiter VARCHAR(255))
/*
This splits a string based on a delimiting substring. It returns a table of the resulting strings, and the locations within the original string.
This technique uses the 'Quirky Update' on a table-update. A table is created with the correct number of delimiters within the string, and is then updated to contain the locations of all the delimiters, and the actual text between the delimiters.
I suspect that there is an improvement to be made somewhere as it should be as fast as the WHILE loop.
*/
RETURNS @strings TABLE
(
line_no INT IDENTITY(1,1),
theIndex INT DEFAULT 0,
previousIndex INT DEFAULT 0,
TheLine VARCHAR(MAX) DEFAULT '')
AS
-- body of the function
BEGIN
DECLARE @TheIndex INT, @Previousindex INT,@LenDelimiter INT
SELECT @TheIndex=-1, @LenDelimiter=LEN(@Delimiter)
--create the string table unfilled but the right length
INSERT INTO @strings(TheIndex)
SELECT 0 FROM numbers
WHERE number <=(LEN(@String)
-LEN(REPLACE(@String,@Delimiter,'')))/LEN(@Delimiter)+1
--and populate the table
UPDATE @strings
SET @PreviousIndex=PreviousIndex=@TheIndex,
@TheIndex=theIndex=CASE WHEN @PreviousIndex<>0
THEN CHARINDEX(@Delimiter,@String,@PreviousIndex+1)ELSE 0 END,
TheLine=CASE WHEN @TheIndex>0 THEN
SUBSTRING(@String,@PreviousIndex+@LenDelimiter,
@TheIndex-@previousindex-@LenDelimiter)
WHEN @PreviousIndex>0 THEN
SUBSTRING(@String,@PreviousIndex+@LenDelimiter,
LEN(@String)-@previousindex)
ELSE '' END
RETURN
END
[/font]
[p] ...and the while loop...[/p]
[font="Courier New"]GO
/*
This splits a string based on a delimiting substring. It returns a table of the resulting strings, and the locations within the original string.
This technique uses the 'While Loop'.
*/
ALTER FUNCTION [dbo].[WhileSplit](@String NVARCHAR(MAX), @Delimiter VARCHAR(255))
RETURNS @strings TABLE
(
line_no INT IDENTITY(1,1),
theIndex INT DEFAULT 0,
previousIndex INT DEFAULT 0,
TheLine VARCHAR(MAX) DEFAULT '')
AS
-- body of the function
BEGIN
DECLARE @pos INT, @next INT, @LenDelimiter INT
SELECT @pos = 1, @lenDelimiter=LEN(@Delimiter),
@next = CHARINDEX(@Delimiter, @String, @pos)
WHILE (@next > 0)
BEGIN
INSERT INTO @Strings (PreviousIndex,TheIndex,TheLine)
SELECT @pos, @next, SUBSTRING(@String, @pos, @next - @pos)
SELECT @pos = @next + @lenDelimiter,
@next = CHARINDEX(@Delimiter, @String, @pos)
END
INSERT INTO @Strings (PreviousIndex,TheIndex,TheLine)
SELECT @pos, @next, SUBSTRING(@String, @pos, LEN(@STRING)-@POS+1)
RETURN
END
[/font]
[p] and the current test harness (this is something I keep fiddling with[/p]
[font="Courier New"]--create a log table to record timings and results
DECLARE @log TABLE
(
Log_ID INT IDENTITY(1, 1),
Event VARCHAR(40),
param1 INT, param2 INT,
InsertionDate DATETIME DEFAULT GETDATE()
)
--various variables
DECLARE @LotsOfText VARCHAR(MAX),
@TestText VARCHAR(MAX),
@Test VARCHAR(80),
@ii INT,
@lentext INT,
@lenString INT,
@linesFound INT
--flag the start of the test run in the log
INSERT INTO @log (event,param1,param2) SELECT 'Start Test run', NULL, NULL
--have we a data table? if not then create it
IF NOT EXISTS (SELECT 1
FROM information_schema.tables
WHERE table_name='SampleText')
CREATE TABLE SampleText(SampleText_ID INT IDENTITY(1,1),
TheText VARCHAR(MAX))
--have we data in the table. If not then read it in.
IF NOT EXISTS (SELECT 1 FROM sampleText)
BEGIN
SELECT @LotsOfText = BulkColumn
-- 2005/8 only for this code)
FROM OPENROWSET(BULK 'c:\workbench\moby-dick.txt',
SINGLE_BLOB) AS x
INSERT INTO SampleText(TheText) SELECT @lotsOfText
END
SELECT @LotsOfText=TheText FROM SampleText WHERE SampleText_ID=1
SELECT @ii=10 --100%, 1=10%. Start with all, then decrease.
SELECT @LenString=LEN(@LotsOfText)--for calulating chop of data.
WHILE @ii>0
BEGIN
--start by creating the size of string for the test
SELECT @TestText =
CASE @ii WHEN 10 THEN @LotsOfText
ELSE SUBSTRING( @LotsOfText,1, CAST((@lenString*(@ii/10.00)) AS INT))
END
SELECT @LenText=LEN(@TestText)
DECLARE @bucket TABLE
(
line VARCHAR(8000)
)
SELECT @LinesFound=COUNT(*), @Test='Tally'
FROM dbo.SplitTally(@TestText)
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test, @LenText, @LinesFound
INSERT INTO @bucket (line) SELECT line
FROM dbo.SplitTally(@TestText)
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test+ ' insert', @LenText, @LinesFound
DECLARE @bucket1 TABLE
(
line VARCHAR(8000)
)
SELECT @LinesFound=COUNT(*), @Test='Regex Split'
FROM dbo.RegExSplit (dbo.regexescape('
'),@TestText,1)
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test, @LenText, @LinesFound
INSERT INTO @bucket1 (line) SELECT MATCH
FROM dbo.RegExSplit (dbo.regexescape('
'),@TestText,1)
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test+ ' insert', @LenText, @LinesFound
DECLARE @bucket2 TABLE
(
line VARCHAR(8000)
)
SELECT @LinesFound=COUNT(*) , @Test='Quirky Update'
FROM dbo.splitStringToTable(@TestText,'
')
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test, @LenText, @LinesFound
INSERT INTO @bucket2 (line) SELECT Theline
FROM dbo.splitStringToTable(@TestText,'
')
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test+ ' insert', @LenText, @LinesFound
DECLARE @bucket3 TABLE
(
line VARCHAR(8000)
)
SELECT @LinesFound=COUNT(*) , @Test='CLR Split'
FROM dbo.DanielJ(@TestText)
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test, @LenText, @LinesFound
INSERT INTO @bucket3 (line) SELECT match
FROM dbo.DanielJ(@TestText)
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test+ ' insert', @LenText, @LinesFound
DECLARE @bucket4 TABLE
(
line VARCHAR(8000)
)
SELECT @LinesFound=COUNT(*) , @Test='WHILE Split'
FROM WhileSplit(@TestText,'
')
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test, @LenText, @LinesFound
INSERT INTO @bucket4 (line) SELECT theLine
FROM WhileSplit(@TestText,'
') INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test+ ' insert', @LenText, @LinesFound
SELECT @LinesFound=COUNT(*) , @Test='Double-barrel tally'
FROM DoubleBarreledTally(@TestText,'
')
INSERT INTO @log (event,param1,Param2) --report time
SELECT @Test, @LenText, @LinesFound
INSERT INTO @bucket4 (line) SELECT theLine
FROM DoubleBarreledTally(@TestText,'
')
innsert INTO @log (event,param1,Param2) --report time
SELECT @Test+ ' insert', @LenText, @LinesFound
SELECT @ii=@ii-1
END
--Now see how long it took...
SELECT event,
[length of string] = param1,
[No. Lines] =param2,
[Time (Ms)] = DATEDIFF(ms,
(SELECT [InsertionDate]
FROM @log f
WHERE f . Log_ID = g . Log_ID - 1
),
[InsertionDate]
)
FROM @log g
WHERE Log_ID >1
GO
[/font]
Best wishes,
Phil Factor
April 26, 2009 at 4:17 am
Hi Phil
This results would be great for cursors! As Paul already wrote could you point me to the "Quirky Update" and "While Loop"? Then I can include it into the other tests.
Thanks a lot!
Flo
April 26, 2009 at 4:20 am
Paul White (4/25/2009)
- no point in re-inventing the wheel, eh? 😉
One of the hobbies of my chief. We have an own string, DateTime, List, DB-Connection, even an own File... :hehe:
April 26, 2009 at 4:23 am
Phil Factor (4/26/2009)
I've already posted the Quirky and While solutions, but I put the harness in just to make the post more interesting to others.
Thanks! I will include it into my other test when I'm back!
... now it's time for barbecue 🙂
Greets
Flo
April 26, 2009 at 5:29 am
A quick post before bed concerning the TVFs posted by Phil.
Unless I am missing something obvious, this solution is heavily dependent on the TVF only being called once.
This is achieved in the script by reading the whole of Moby Dick into a NVARCHAR(MAX) variable.
Changing it's datatypes to VARCHAR(8000) on the way in and VARCHAR(100) out, and running it against the first 200 lines of Jeff's test table:
ALTER FUNCTION [dbo].[WhileSplit](@String VARCHAR(8000), @Delimiter VARCHAR(255))
*snip*
TheLine VARCHAR(100) DEFAULT '')
*snip*
GO
SELECTWS.line_no, WS.theLine
INTOdbo.Result
FROM(SELECT TOP 200 * FROM dbo.JBMTest) AS J
CROSS
APPLYdbo.WhileSplit (J.CSV, ',') AS WS
Table 'Worktable'. Scan count 1, logical reads 363973, physical reads 566, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#0BC6C43E'. Scan count 200, logical reads 800, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'JBMTest'. Scan count 1, logical reads 203, physical reads 152, read-ahead reads 479, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(160200 row(s) affected)
SQL Server Execution Times:
CPU time = 141484 ms, elapsed time = 189769 ms.
I ran it on the first 200 rows only because of time constraints. As it is, it took over three minutes...
The actual and estimated execution plans are attached. The actual plan doesn't show the inner execution of the TVF so that's why the estimated one is there.
Cheers,
Paul
April 26, 2009 at 5:31 am
Florian Reischl (4/26/2009)
Paul White (4/25/2009)
- no point in re-inventing the wheel, eh? 😉One of the hobbies of my chief. We have an own string, DateTime, List, DB-Connection, even an own File... :hehe:
I have no words to adequately convey my feelings on that :crazy:
:sigh:
You have my sympathy!
Paul
April 26, 2009 at 7:24 am
Hey Paul, I followed your link about the new flag. My son has lived in Dunedin for years now, but never mentioned this to me. As an American, I can cast no vote, but personally I favor the Silver Fern. 🙂
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 26, 2009 at 9:49 am
Hi Phil
I just ran exactly your script on my system. I just removed the Tally version because this is definitely the wrong decision for this requirement.
The chart is attached. Hope it will also be shown here...
I don't know why, but there seems to be a huge difference in the execution between our systems.
Edit: Please forgive the lousy formatting. I'm no Excel guru...
Greets
Flo
April 26, 2009 at 11:50 am
Wow... Just looked back at some of the graphs.... I've gotta find out which While Loop and Tally table methods Phil used to come up with his graphs because those numbers just seem really odd to me. Also, if I recall correctly, when Phil and I did some testing on Moby Dick a while back, the while loop won on his machine while the Tally table one on mine.
Heh... The "split" function is certainly turning out to be a huge "it depends". Depends on the machine, depends on delimiter density, depends on delimiter randomness between rows, etc, etc, etc. Flo, I don't envy you in the task of writing and article about all of this.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 26, 2009 at 3:03 pm
Bob Hovious (4/26/2009)
Hey Paul, I followed your link about the new flag. My son has lived in Dunedin for years now, but never mentioned this to me. As an American, I can cast no vote, but personally I favor the Silver Fern. 🙂
Yeah, me too - I had it as my avatar for a while, but figured I should spread the love a little.
The Silver Fern would make one cool flag - the only slight downside would be the licensing costs to the All Blacks...
I think the 'new flag' guys welcome support from whatever quarter, so thanks for that.
@Flo
Regex is pretty awesome for long strings eh?
April 26, 2009 at 3:38 pm
[font="Verdana"]I like the southern cross, and I like the silver fern and I like the kiwi. They're all our national symbols. I don't see why the flag need change (unless we become a republic, which at the moment doesn't seem to be a serious thing.)
Now, back to the excellent code!
I did have one thought, although I don't see it as practical in this case. When I was at University we were doing performance analysis on sorting algorithms (which I believe is a classic). We noticed that for small input, even a bubble sort would beat a quick sort, but a bubble sort soon fell out of fashion as the input size increased. However, a quick sort works by chunking the input, and as the chunk sizes got smaller, you could swap sort algorithms to get the best performance.
Would it be possible to use a hybrid solution? I don't think so, because it appears that the result can't be "chunked" (or partitioned) in any way. It also seems that the two CLR solutions are still performing well.
[/font]
April 26, 2009 at 4:17 pm
Hmmm.. I'm really scratching my head now wondering about the differences between the two test runs, Flo. At least we can agree that the Tally doesn't cut it when the delimiters are widely spaced. With my system, both the regex solutions (Yours and mine) produced identical dire performance, so bad I took them out of the test because they made the test runs take far too long.
If we can find out what makes the CLR solutions run so fast on your tests, we ought to bottle it and sell it to everyone. There is a lot of work to do.
In answer to paul's comments about the cross apply, It doesn't surprise me that it the CROSS APPLY runs like a fly in syrup. I have Quirky and WHILE loop solutions to Flo's original problem, but I got a bit distracted when I got interested in the relative performance of various TVFs on large strings. I'll dust them out.
Best wishes,
Phil Factor
April 27, 2009 at 5:53 am
@Phil, Flo:
Probably the easiest way would be to compare your respective .sqlplan actual execution plans for the queries in question, together with the usual CPU and IO information...?
Don't get me wrong - I like the existing flag, it's just a bit similar to another I could mention, and doesn't have anything very NZ about it apart from the Southern Cross. A design that incorporates a little more of NZ without throwing away the good things on the current flag would be perfect. I just think the present balance is a little off.
Paul
P.S. One obvious disadvantage of the CLR solutions which I forgot before is that T-SQL is more easily re-used. Copying a procedure or function definition in T-SQL is easier and faster than creating an assembly and a T-SQL stub. This is the main reason I still use a ROW_NUMBER() generated numbers table for quick one-off string splits.
Viewing 15 posts - 226 through 240 (of 522 total)
You must be logged in to reply to this topic. Login to reply