April 25, 2009 at 9:07 am
Florian Reischl (4/25/2009)
*ARGH!*
Yeah, I know how you feel...!
By the way, reading the whole data column (all rows at once) inside the CLR TVF works very well. Splitting that data also works well. However, writing the results to the database down the SqlPipe is horrible. It ends up being about as slow as the UNION ALL tally on the unique data set.
It's a shame, because the plan was to read the CSV data into a streaming SqlDataReader, writing the results back down the context connection pipe through the streaming CLR TVF interface.
The problem is that one is only allowed to do data operations in the InitMethod part of the CLR function. If you try to create a SqlDataReader in Init and then read it on the streaming side (while being called through IEnumerator) the SqlDataReader is closed and all attempts to access data fail. Annoying.
It would have been great because one could just write:
SELECT * FROM dbo.My_Streaming_TVF ('dbo.DataTable', 'KeyColumnName', 'DataColumnName')
and you would get all the results at once, with just one call through reflection (or whatever it uses to bind to the IEnumerator), and both reading and writing would have been streaming, and therefore very scalable. Oh well, never mind.
Another (non-CLR) solution I tried was to bulk-copy the data table out to a file in character format, using no field terminator, and a comma as the record terminator. Bulk copying the file back in using comma as the field terminator *and* record terminator results in the file being split quite nicely. It's actually quite fast, and should scale extremely well - it's just a shame I have to use character format to use the comma-trick: native bcp would be much much faster. I will post the code to do this, just for your amusement.
One last thing. Using SELECT INTO to create the result table is a bit unfair on any method (e.g. a stored procedure) which cannot be used that way. SELECT INTO can be bulk logged if the conditions are right, whereas INSERT INTO never is, at least on 2K5.
Cheers!
Paul
edit: TCF != TVF :c)
April 25, 2009 at 9:12 am
Derek Dongray (4/24/2009)
Bob Hovious (4/23/2009)
With respect to testing, and I apologize for not having done more, it seems there are at least three or four variables about the test data to consider:1. The number of rows in the table
2. The average size of the data in the column to be parsed
3. The average number of delimited strings to be parsed per row
4. The average size of the strings being parsed (really the inverse of #3 above).
I never thought about this before, but would the optimizer choose a different execution plan based on the characteristics (RAM, number of processors) of the server (or instance) it was running on?
I'd always been aware that all the above could affect the choice of execution plan, but usually found in practice that one plan was a clear winner in all situations so there was never any choice.
This is the first example I've found where SQLserver changes the plan based simply on the number of rows input.
... and the number of delimiters
... and the number of characters between the delimiters
... and the randomness of the position of delimiters from line to line
... and whether or not the string goes "off page" with VARCHAR(MAX) or not
--Jeff Moden
Change is inevitable... Change for the better is not.
April 25, 2009 at 9:17 am
Florian Reischl (4/25/2009)
@Bob, Jeff and the rest of the tally mastersThis becomes a bit esoteric but as Bob said "if one tally is fast, two are faster"... If my UNION ALL dual-carburetor tally is fast and Bob's double-barreled tally solution is faster, a double-barreled dual-carburetor tally is even faster . The source becomes quiet huge for a requirement to split strings, but is 10% faster and it handles the first and last element:
In my very brief testing (I'll eventually have time to publish it here) against 799 random numbers (0 to 1 billion) per row for only 1,000 rows, I actually had to abort the run for that method because it was so slow.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 25, 2009 at 9:21 am
Florian Reischl (4/25/2009)
The ArticleSorry that the article is still pending, but it is quiet hard to write an article since all test results still changing... I didn't forget, I just have to finish all tests.
Take your time, Flo... testing to make sure things are repeatable is key for this type of article. And, with so many permutations of data that affect the outcome of runs, it's gonna take some time.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 25, 2009 at 10:10 am
Flo,
As threatened^H^H^H^H promised, here is a quick example of the BCP method (in SQLCMD mode in SSMS):
Doing this in tempdb obviates the need to set recovery appropriately to enable fast bulk copy.
-- Create output table to hold the split rows
-- Column must be NULL because of leading and trailing commas
if object_id(N'tempdb.dbo.Rows', N'u') is not null drop table tempdb.dbo.Rows;
create table tempdb.dbo.Rows (Data varchar(8000) null);
-- Bulk copy out to temp file (< 400ms for me)
!!bcp tempdb.dbo.JBMTest out %TEMP%\JBMTest.bcp -c -t "" -r "," -S (local) -T
-- Bulk copy in splitting at the commas (< 4,000ms here)
!!bcp tempdb.dbo.Rows in %TEMP%\JBMTest.bcp -r "," -c -S (local) -T -h "TABLOCK"
-- Tidy up
!!del %TEMP%\JBMTest.bcp
Just tried it with Moby Dick:
197376 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 1187 Average : (166281.38 rows per sec.)
Cheers,
Paul
edit: added Moby Dick results
April 25, 2009 at 10:28 am
Paul White (4/25/2009)
Hey Flo!These new tally solutions are only fast if all the rows in dbo.JBMTest are identical.
That's exactly what my testing was showing me. Randomness of position of the delimiters from row to row makes or breaks the double barreled Tally methods and several others.
As I said, I've not had a huge amount of time to do any of the testing, but I did make a "heart breaker" set of data to test against. Here's the code to build the test table...
[font="Courier New"]/**************************************************************************************************
Purpose:
Generate highly randomized CSV rows of random numbers for CSV Splitter testing. It' virtually
impossible that this code will create any identical rows with this code. It's not "Moby Dick",
but it does produce CSV's that are more likely to be seen in the real world.
I wrote this so that it's compatible with both SQL Server 2000 and 2005 so virtually anyone
can play along. The only thing that needs a manual change to run in 2k is the object of the TOP
statement in the actual table creation code below.
Notes:
1. The code is setup for VARCHAR(8000). To test for VARCHAR(MAX), do a Search'n'Replace to
change all VARCHAR(8000) to VARCHAR(MAX).
2. If you get the following error, you've selected too many elements for the size numbers and
have most likely exceeded the VARCHAR(8000). You simply need to reduce either the range of
numbers or the number of elements, or both. Of course, that probably won't happen with
VARCHAR(MAX) if you've changed to that.
Msg 8152, Level 16, State 14, Line 32
String or binary data would be truncated.
The statement has been terminated.
3. The settings in the code are currently set to make 800 elements with numbers ranging from
1 to 1,000,000,000 (1 to 10 characters) that pretty much fill up VARCHAR(8000).
4. Look for "Change the values here" in the code below.
-- Jeff Moden
**************************************************************************************************/
--===== Change to a safe place to "play"
USE TempDB
GO
--=================================================================================================
-- Conditionally drop the special objects that we'll need to create
-- (just to simplify reruns)
--=================================================================================================
IF OBJECT_ID('TempDB.dbo.RandomPositiveInt') IS NOT NULL
DROP VIEW dbo.RandomPositiveInt
IF OBJECT_ID('TempDB.dbo.RandomIntCSV') IS NOT NULL
DROP FUNCTION dbo.RandomIntCSV
GO
--=================================================================================================
--===== Create a view to return random positive integers starting at 0.
-- This is necessary because you cannot yet use NEWID() in a UDF even in 2k5.
--================================================================================================='
CREATE VIEW dbo.RandomPositiveInt AS
SELECT ABS(CHECKSUM(NEWID())) AS RandomPositiveInt
GO
--=================================================================================================
--===== Create a function to make predictable ranges of CSV'd numbers
--=================================================================================================
CREATE FUNCTION dbo.RandomIntCSV(@NumberOfElements INT, @Range INT, @Offset INT)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @ReturnCSV VARCHAR(8000)
SELECT @ReturnCSV = ISNULL(@ReturnCSV+',','')
+ CAST(rpi.RandomPositiveInt%@Range+@Offset AS VARCHAR(10))
FROM dbo.Tally t
CROSS JOIN dbo.RandomPositiveInt rpi
WHERE t.N <= @NumberOfElements
RETURN @ReturnCSV
END
GO
--=================================================================================================
-- This section of code creates the test table based on
--=================================================================================================
--===== Declare some control variables that we can change the values of to make the generation
-- of different size and width tables easy
DECLARE @NumRows INT, --Number of rows to create
@NumElements INT, --Per row
@ElementRange INT, --Number of numbers in the range
@ElementOfffSet INT, --First number to appear in the range
@AddDelimiters CHAR(1) --If = 'Y', adds leading/trailing delimiters
--===== Change the values here
SELECT @NumRows = 1000,
@NumElements = 800,
@ElementRange = 1000000000,
@ElementOffSet = 1,
@AddDelimiters = 'Y'
--===== Conditionally drop the test table so we can rebuild it on reruns
IF OBJECT_ID('dbo.JBMTest') IS NOT NULL
DROP TABLE dbo.JBMTest
--===== Create and populate the test table on the fly
SELECT TOP (@NumRows) -- You will need to hard code the value here in 2k.
IDENTITY(INT,1,1) AS RowNum,
dbo.RandomIntCSV(@NumElements,@ElementRange,@ElementOffSet) AS CSV
INTO dbo.JBMTest
FROM dbo.Tally
--===== Conditionally add leading and trailing delimiters
IF @AddDelimiters = 'Y'
UPDATE dbo.JBMTest
SET CSV = ','+CSV+','
--===== Add a Primary Key
ALTER TABLE dbo.JBMTest
ADD CONSTRAINT PK_JBMTest_RowNum
PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100
--===== Display the stats for the new test table just for convenience in testing
SELECT COUNT(*) AS MeasuredRowCount,
@NumElements AS DesignedNumElements,
CASE
WHEN @AddDelimiters = 'Y'
THEN MAX(LEN(CSV) - LEN(REPLACE(CSV,',','')) - 1)
ELSE MAX(LEN(CSV) - LEN(REPLACE(CSV,',','')) + 1)
END AS MeasuredNumElements,
@ElementOffSet AS DesignedElementOffset,
@ElementRange AS DesignedElementRange,
@AddDelimiters AS DelimitersAdded,
MIN(LEN(CSV)) AS MinCsvLength,
AVG(LEN(CSV)) AS AvgCsvLength,
MAX(LEN(CSV)) AS MaxCsvLength
FROM dbo.JBMTest
GO
[/font]
And, Yes, I agree with Paul... if this stuff is in a file, using BCP to do the splits is one of the easiest and fastest ways.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 25, 2009 at 10:31 am
Paul White (4/25/2009)
One last thing. Using SELECT INTO to create the result table is a bit unfair on any method (e.g. a stored procedure) which cannot be used that way. SELECT INTO can be bulk logged if the conditions are right, whereas INSERT INTO never is, at least on 2K5.
Still, I believe you'll find that SELECT INTO is still faster than INSERT INTO even on a FULL recovery database.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 25, 2009 at 10:40 am
Jeff Moden (4/25/2009)
That's exactly what my testing was showing me. Randomness of position of the delimiters from row to row makes or breaks the double barreled Tally methods and several others.
The point I was making was that the UNION ALL method and it's derivatives split one row and then copy the results for all subsequent rows. The relatively good performance relies on the rows being *exact duplicates*. On more realistic data all these methods appear slower than the original tally solution.
The tally methods seem to dislike rows with relatively few delimiters and/or long strings. This may be because these methods do a substring at every character position - with few delimeters, much of this effort is wasted. With long strings, the substrings take longer.
Jeff Moden (4/25/2009)
...if this stuff is in a file, using BCP to do the splits is one of the easiest and fastest ways.
For some large tasks, it may be worth doing the bcp out and in trick.
Paul
April 25, 2009 at 10:44 am
Jeff Moden (4/25/2009)
Paul White (4/25/2009)
One last thing. Using SELECT INTO to create the result table is a bit unfair on any method (e.g. a stored procedure) which cannot be used that way. SELECT INTO can be bulk logged if the conditions are right, whereas INSERT INTO never is, at least on 2K5.Still, I believe you'll find that SELECT INTO is still faster than INSERT INTO even on a FULL recovery database.
I'm not sure that is true for a small number of rows. Maybe Flo could run 201 tests...? :laugh:
I was thinking of tempdb really because that is where a lot of tests get run.
TempDB always has Simple recovery which would give SELECT INTO the maximum advantage.
April 25, 2009 at 11:29 am
Paul White (4/25/2009)
The point I was making was that the UNION ALL method and it's derivatives split one row and then copy the results for all subsequent rows. The relatively good performance relies on the rows being *exact duplicates*. On more realistic data all these methods appear slower than the original tally solution.
Agreed for the most part. As with you, my all-to-brief testing showed that when every row has delimiters in the same spot or very few delimiters over wide rows, the UNION ALL method runs quite fast. Randomize the postion of the delimiters from row-to-row and it slows down quite a bit. I'm still doing some occasional testing but, right now, the Union All method still appears to be faster than a flat Tally solution by a factor of about 700% (7 times faster) on the worst cast of totally random delimiters like what my test code makes.
For some large tasks, it may be worth doing the bcp out and in trick.
Absolutely agreed on this. The only step you've left out is hunting down and force feeding a whole truck load of pork chops to the moron (or, "moroff") that originally populated the column as a CSV. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
April 25, 2009 at 11:46 am
Paul White (4/25/2009)
Maybe Flo could run 201 tests...? :laugh:
Pew... I give up with all those static test environments. I currently designed a generic test environment to cover different test at once. Please let me know if there is anything wrong. Yes, it uses a WHILE-loop but only to ensure that any type of strings can simply be created.
@Jeff
Sorry, your posted test environment is really good but I need something more flexible and a possibility to run different tests at once.
[font="Courier New"]SET NOCOUNT ON
GO
-- Just create dummy table for SSMS-2k8 intelli sense
IF (OBJECT_ID('dbo.Strings') IS NOT NULL)
DROP TABLE dbo.Strings
CREATE TABLE dbo.Strings (Id INT NOT NULL, Csv VARCHAR(8000))
GO
----===============================================
---- Create a configuration table for all the CSV tests
IF (OBJECT_ID('dbo.StringsConfig') IS NOT NULL)
DROP TABLE dbo.StringsConfig
CREATE TABLE dbo.StringsConfig
(
Id INT NOT NULL IDENTITY,
ItemLength INT,
ItemCount INT,
ItemRows INT,
ItemType NVARCHAR(128)
)
GO
----===============================================
---- Fill any type of configurations which shall be tested
INSERT INTO dbo.StringsConfig
SELECT 10, 10, 10, N'VARCHAR(111)'
UNION ALL SELECT 10, 10, 1000, N'VARCHAR(111)'
--UNION ALL SELECT 10, 10, 10000, N'VARCHAR(111)'
--UNION ALL SELECT 10, 10, 20000, N'VARCHAR(111)'
--UNION ALL SELECT 10, 10, 100000, N'VARCHAR(111)'
UNION ALL SELECT 10, 100, 10, N'VARCHAR(1101)'
--UNION ALL SELECT 10, 100, 1000, N'VARCHAR(1101)'
--UNION ALL SELECT 10, 100, 10000, N'VARCHAR(1101)'
--UNION ALL SELECT 10, 100, 20000, N'VARCHAR(1101)'
UNION ALL SELECT 10, 1000, 10, N'VARCHAR(MAX)'
--UNION ALL SELECT 10, 1000, 100, N'VARCHAR(MAX)'
--UNION ALL SELECT 10, 1000, 1000, N'VARCHAR(MAX)'
--UNION ALL SELECT 10, 1000, 5000, N'VARCHAR(MAX)'
UNION ALL SELECT 100, 10, 10, N'VARCHAR(1011)'
--UNION ALL SELECT 100, 10, 100, N'VARCHAR(1011)'
--UNION ALL SELECT 100, 10, 1000, N'VARCHAR(1011)'
--UNION ALL SELECT 100, 10, 10000, N'VARCHAR(1011)'
--UNION ALL SELECT 100, 10, 20000, N'VARCHAR(1011)'
GO
----===============================================
---- Delete previous test results from PerfMon table
DELETE FROM PerfMon
GO
----===============================================
--- This is the batch which contains all tests
--- It will be executed several times by: GO xxx
DECLARE @ConfigId INT
DECLARE @ItemLength INT
DECLARE @ItemCount INT
DECLARE @ItemRows INT
DECLARE @ItemType NVARCHAR(128)
DECLARE @sql NVARCHAR(MAX)
-- Get the next configuration for strings table
SELECT TOP(1)
@ConfigId = Id,
@ItemLength = ItemLength,
@ItemCount = ItemCount,
@ItemRows = ItemRows,
@ItemType = ItemType
FROM dbo.StringsConfig
ORDER BY Id
-- Delete current configuration
DELETE FROM StringsConfig WHERE Id = @ConfigId
IF (@@ROWCOUNT = 0)
RETURN
----===============================================
---- Client output for current configuration
PRINT ('')
PRINT ('')
PRINT ('----===================================================')
PRINT ('---- Next Run')
PRINT ('---- ' +
'Item Length: ' + CONVERT(VARCHAR(10), @ItemLength) +
' | Count Per Row: ' + CONVERT(VARCHAR(10), @ItemCount) +
' | Row Count: ' + CONVERT(VARCHAR(10), @ItemRows) +
' | Item Type: ' + @ItemType)
----===============================================
---- Create a new dbo.Strings table with current configuration
-- Drop existing
EXECUTE sp_executesql N'IF (OBJECT_ID(''dbo.Strings'') IS NOT NULL) DROP TABLE dbo.Strings'
-- Create new dbo.Strings table
SELECT @sql = 'CREATE TABLE dbo.Strings (Id INT NOT NULL, Csv ' + @ItemType + ')'
EXECUTE sp_executesql @sql
----===============================================
--- Dynamic SQL to fill the current dbo.Strings table with test data
SELECT @sql = '
INSERT INTO dbo.Strings (Id, Csv)
SELECT TOP(@ItemRows)
N, ''''
FROM dbo.Tally
DECLARE @ReplicationCount INT
SELECT @ReplicationCount = CASE WHEN @ItemLength <= 36 THEN 1 ELSE @ItemLength / 36 END
WHILE (@ItemCount > 0)
BEGIN
UPDATE dbo.Strings SET Csv = Csv + '','' + CONVERT(' + @ItemType + ', LEFT(REPLICATE(NEWID(), @ReplicationCount), @ItemLength))
SELECT @ItemCount = @ItemCount - 1
END
UPDATE dbo.Strings SET Csv = Csv + '',''
ALTER TABLE dbo.Strings
ADD CONSTRAINT PK_Stringgs
PRIMARY KEY CLUSTERED (Id)
'
EXECUTE sp_executesql
@sql,
N'@ItemCount INT, @ItemLength INT, @ItemRows INT',
@ItemCount = @ItemCount,
@ItemLength = @ItemLength,
@ItemRows = @ItemRows
/* CHECK TEST DATA
SELECT TOP(100) LEN(Csv), * FROM dbo.Strings
*/
----===============================================
---- Discard all possibly existing, definitely wrong plans
DBCC FREEPROCCACHE WITH NO_INFOMSGS
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS
----================================================================
----
---- PLACE FOR ANY TESTS ON CSV ALIKE DATA
----
----================================================================
GO 10 -- MUST BE AT LEAST THE COUNT OF ITEMS IN StringsConfig
[/font]
Greets
Flo
April 25, 2009 at 11:55 am
Florian Reischl (4/25/2009)
@JeffSorry, your posted test environment is really good but I need something more flexible and a possibility to run different tests at once.
Heh... no need to apologize, Flo... at least we're making some progress towards something more real-world than totally static, homogenous, and single line testing (although single line testing is appropriate for GUI application of split functions).
--Jeff Moden
Change is inevitable... Change for the better is not.
April 25, 2009 at 1:35 pm
[p]I'm afraid I've been still plugging away with splitting chunks of Moby Dick into lines. This is to try and find which function returns a table of lines from Moby Dick the fastest.[/p]
[p]Why Moby Dick? Because it is real text with lots of variation in the number of lines and all sorts of unpredictabilities. There is lots of it too. It is easy to get hold of.[/p]
[p]Why find the best function to do this? Because the key to the original task is surely to cross apply with a function, if we are to get a realistic comparison between techniques. [/p]
[p]Why persevere? Because it is revealing a lot about the comparative performance of the various techniques under the sort of heavy load you get from -say- importing a logfile from IIS, or a datafeed from a piece of test equipment.[/p]
Testrun on a slow server (The Dog-2005) (horizontal line is size of string)- see the key on the lower graph. bloody thing seems to have vanished.
Testrun on fast server (2005)
Best wishes,
Phil Factor
April 25, 2009 at 2:22 pm
So are we just saying it depends?
Seriously, if this can be boiled down to performance characteristics tied to a matrix of variables, it would be a valuable resource to help select which technique(s) to use. Developers could look at the characteristics, including volume , of the data to be parsed and pick an appropriate technique. Phil, I've never had to parse anything like Moby Dick, but I would not bet the ranch that I will never be asked to.
Jeff: How's the book coming along. ?
Flo: I love the RE-double. I have a feeling that learning to nest Cross Apply queries is going to prove valuable in other areas, as is knowing when to TOP().
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 25, 2009 at 3:40 pm
Phil, I've never had to parse anything like Moby Dick, but I would not bet the ranch that I will never be asked to.
Most data feeds are a bit more regular, I'll admit, but then it is great for testing text parsing routines. I've had to parse larger strings than this before now. (e.g. daily feed from a telecommunications switch) though I've usually used BCP to do the line-splitting for me. I've been more interested in doing this sort of thing for website files for fast indexing, recently.
Best wishes,
Phil Factor
Viewing 15 posts - 211 through 225 (of 522 total)
You must be logged in to reply to this topic. Login to reply