May 4, 2009 at 3:51 am
I did some testing myself using the testsuite Jeff made a few posts back. Thanks, it ran good first time....and on my machine (a pretty fast one) the timings indicate my IO less function is just as fast as the crossjoin method Jeff made.
In fact all methods are pretty close together, what it now comes down to is resource use. We should determine if the tableless versions cap out on memory at some point. Again I got plenty of memory so its hard to test for me as I cant restrict the amount of memory SQL Server can use, other people are working on it too. And we need to put each method into a set of different slutions to see how the affects queryplans and overall efficiency. I think this is where real differences will show up!
Can somone else take a look at it as well?
And I am also interested to see how this function works on lesser equipment....calling Jeff here ;)...
May 4, 2009 at 8:29 am
Here is an update to my tally function, it has now slightly less overhead by eliminating two cross joins. I did this by moving from a 4 bit (0-16) constant table to 6 bit (0-63) wide one.
The result is a slight speed bump when using large numbers and less clutter when examining execution plans in code that uses the function!
-- Tally table function (24 bit, large enaugh for general purpose use)
--
create function dbo.tfnTally24bit( @max-2 int ) returns table
as
return
(
with
nQ( N ) as
(
select 0 union all select 0 union all select 0 union all select 0 union all -- 4
select 0 union all select 0 union all select 0 union all select 0 union all -- 8
select 0 union all select 0 union all select 0 union all select 0 union all -- 12
select 0 union all select 0 union all select 0 union all select 0 union all -- 16
select 0 union all select 0 union all select 0 union all select 0 union all -- 20
select 0 union all select 0 union all select 0 union all select 0 union all -- 24
select 0 union all select 0 union all select 0 union all select 0 union all -- 28
select 0 union all select 0 union all select 0 union all select 0 union all -- 32
select 0 union all select 0 union all select 0 union all select 0 union all -- 36
select 0 union all select 0 union all select 0 union all select 0 union all -- 40
select 0 union all select 0 union all select 0 union all select 0 union all -- 44
select 0 union all select 0 union all select 0 union all select 0 union all -- 48
select 0 union all select 0 union all select 0 union all select 0 union all -- 52
select 0 union all select 0 union all select 0 union all select 0 union all -- 56
select 0 union all select 0 union all select 0 union all select 0 union all -- 60
select 0 union all select 0 union all select 0 union all select 0 -- 64
)
select top ( isnull( @max-2, 0 ) )
row_number() over ( order by anchor.constant ) as n
from
( select 0 ) as anchor( constant )
cross join nQ as n1 -- 64 ( 6 bit)
cross join nQ as n2 -- 4096 ( 12 bit)
cross join nQ as n3 -- 262144 ( 18 bit)
cross join nQ as n4 -- 16777216 ( 24 bit)
)
;
Enjoy
May 5, 2009 at 9:20 pm
Me again....
I modified the classic Tally solution for string splitting as found on the initial pages and made speed improvements.
This solution is unlikely to suffer from the predictable data problem that the really fast solutions turned out to be victim of.
This code is based on that of Jeff which got improved by Barry and now its my turn....grin.
I will explain in a bit after you inspected the code for yourself :).
create function dbo.fnSplitClassicTweak
(
@parameter varchar(Max)
, @Separator Varchar(64)
, @sectorsize int = 8 -- have this size closly match the expected separator-less blocks in the input
)
returns
@Items TABLE
(
ID INT identity(1,1) primary key clustered -- the element number
, item VARCHAR(8000) not null -- the split-out string element
, OffSet int not null -- the original offest / ( not entirley accurate if LEN(@Seperator) > 1 because of the Replace() )
)
as
begin
--our seperator character (convenient, doesn't affect performance)
declare @Sep char(1);
select @Sep = char( 10 );
--NOTE: we make the @Sep character LF so that we will automatically
-- parse out rogue LF-only line breaks.
--===== Add start and end seprators to the Parameter so we can handle
-- all the elements the same way
-- Also change the seperator expressions to our seperator
-- character to keep all offsets = 1
select @Parameter = @Sep + Replace( @Parameter, @Separator, @Sep ) + @Sep;
insert into @Items( Offset, item )
select
(charpos.N + sector.N) + 1
, substring( @Parameter, (charpos.N + sector.N) + 1, charindex( @Sep, @Parameter, (charpos.N + sector.N) + 1) - (charpos.N + sector.N) - 1 )
from
(
select
@sectorsize * (s.n - 1) as N
from
dbo.tfnTally24bit( (datalength( @parameter ) + @sectorsize - 1) / @sectorsize ) as s
where
-- Notice how we idenify sectors of interest
charindex( @Sep, substring( @parameter, 1 + (@sectorsize * (s.n - 1)), @sectorsize ) ) != 0
) as sector
cross join dbo.tfnTally24bit( @sectorsize ) as charpos
where
-- Notice how we find the separator within for a sector of interest
(charpos.N + sector.N) < datalength( @Parameter ) and substring( @Parameter, (charpos.N + sector.N), 1 ) = @Sep
order by 1
;
return;
end
go
This is what I call an engineering solution....it address some of the drawbacks and adds a few minor tweaks....no radical things.
Improvements
1. The tally table from my previous post (no I/O, and in all test I did so far it performs as the best).
2. The result table now has an identity column, which makes it possible for me to dispose of the row_number() over construction as I can do an order by 1 on the character index to generate the IDs.
3. Added a clustered primary key in the result table.
4. Now there a configurable mechanism that splits the input string into even sized pieces (I call them sectors). You can configure the size of these sectors so it best matches the separator distances typically experienced in your data.
Point 4 is by far the most significant improvement as no longer is every character inspected by a costly substring operation. First all sector are scanned using charindex and if a match is found that sector becomes part of the derived table that contains all sectors to be inspected in the classic way. Thus if your sector size is small enough to have plenty of them without separator and the still larger then 3 to 4 characters you should save on string manipulation costs and thus gain speed.
I hope to have given a new impulse and some new ideas for you guys to incorporate into your own solutions, be it string splitting or otherwise 🙂
Looking forward to some replies, its getting lonely here and it was a hard sleepless night work!
Cheers!
Special note to Jeff: I saved your tally.......it beats the cursor based function 🙂
May 5, 2009 at 9:32 pm
peter (5/4/2009)
Here is an update to my tally function, it has now slightly less overhead by eliminating two cross joins.
As you keenly observed before, all of these are now in the area of greased lightning and it comes down to which kind of resources you want to expend. Here's the same million row test on my "slow" box with your latest function...
[font="Courier New"]====================================================================================================
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
===== Matt Miller's Method =====
SQL Server Execution Times:
CPU time = 906 ms, elapsed time = 979 ms.
====================================================================================================
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
===== Itzek's Method =====
SQL Server Execution Times:
CPU time = 844 ms, elapsed time = 853 ms.
====================================================================================================
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
===== Jeff Moden's Method
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table 'syscolrdb'. Scan count 2, logical reads 98, physical reads 0, read-ahead reads 115, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 1, logical reads 7, physical reads 1, read-ahead reads 16, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 719 ms, elapsed time = 832 ms.
====================================================================================================
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
===== RBarryYoung's Method
Table 'syscolrdb'. Scan count 2, logical reads 22, physical reads 2, read-ahead reads 97, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 2, logical reads 10, physical reads 1, read-ahead reads 47, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 734 ms, elapsed time = 808 ms.
====================================================================================================
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
===== Combined Method
Table 'syscolrdb'. Scan count 2, logical reads 12, physical reads 0, read-ahead reads 92, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 2, logical reads 14, physical reads 1, read-ahead reads 16, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 703 ms, elapsed time = 865 ms.
====================================================================================================
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
[highlight="YELLOW"]===== Peter's Method
SQL Server Execution Times:
CPU time = 719 ms, elapsed time = 726 ms.[/highlight][/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
May 5, 2009 at 9:41 pm
Thanks for the information Jeff...care to take a quick look at what your string splitter evolved into as well?
PS.
I see still room for improvement for the classic method..my strategy is to eliminate as much substring operations as possible as that is the main killer!
In fact I already got a faster version ready, but am still trying to squeeze more out of it 🙂
First a slight tally update:
-- Tally table function (24 bit, large enaugh for general purpose use)
--
alter function dbo.tfnTally24bit( @max-2 int ) returns table
as
return
(
with
nQ( N ) as
(
select 0 union all select 0 union all select 0 union all select 0 union all -- 4
select 0 union all select 0 union all select 0 union all select 0 union all -- 8
select 0 union all select 0 union all select 0 union all select 0 union all -- 12
select 0 union all select 0 union all select 0 union all select 0 union all -- 16
select 0 union all select 0 union all select 0 union all select 0 union all -- 20
select 0 union all select 0 union all select 0 union all select 0 union all -- 24
select 0 union all select 0 union all select 0 union all select 0 union all -- 28
select 0 union all select 0 union all select 0 union all select 0 union all -- 32
select 0 union all select 0 union all select 0 union all select 0 union all -- 36
select 0 union all select 0 union all select 0 union all select 0 union all -- 40
select 0 union all select 0 union all select 0 union all select 0 union all -- 44
select 0 union all select 0 union all select 0 union all select 0 union all -- 48
select 0 union all select 0 union all select 0 union all select 0 union all -- 52
select 0 union all select 0 union all select 0 union all select 0 union all -- 56
select 0 union all select 0 union all select 0 union all select 0 union all -- 60
select 0 union all select 0 union all select 0 union all select 0 -- 64
)
select top ( isnull( @max-2, 0 ) )
cast( row_number() over ( order by anchor.constant ) as int ) as n
, cast( row_number() over ( order by anchor.constant ) as tinyint ) as n8
, cast( row_number() over ( order by anchor.constant ) as smallint ) as n16
, row_number() over ( order by anchor.constant ) as n64
from
( select 0 ) as anchor( constant )
cross join nQ as n1 -- 64 ( 6 bit)
cross join nQ as n2 -- 4096 ( 12 bit)
cross join nQ as n3 -- 262144 ( 18 bit)
cross join nQ as n4 -- 16777216 ( 24 bit)
)
;
The cast to a 32 bit integer is slightly faster when done in the tally function instead of numerous implicit casts in the consuming queries!
Added also a few other columns to different integer sizes, only the one you select will be generated, so far I detected no overhead in this and it makes selecting s certain integer type more convenient.
May 9, 2009 at 6:03 pm
So, I'm interested to hear the conclusion of this matter. Peter seems to have picked up on a way to extract more out of the tally table method. Is this enough? What about the scalability issues that existed with the CLR have those been solved?
I'd like to hear Flo's final analysis on this or someone else who has been working this problem through.
I've got my popcorn ready.
May 9, 2009 at 7:57 pm
peter (5/5/2009)
Thanks for the information Jeff...care to take a quick look at what your string splitter evolved into as well?
Sorry Peter... I'm out of coffee and I'm not sure of what you speak. What is it that you'd like me to take a look at? Since this thread has gotten quite long, would you mind posting the splitter code you want me to look at?
Or, are you talking about the new Tally generator you just posted?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2009 at 8:06 pm
Sorry, Peter... I got it. You were talking about the code in the previous post.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2009 at 8:32 pm
Peter,
How do I determine what the optimal sector size in your splitter code should be?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2009 at 9:31 pm
Phil Factor (5/1/2009)
Jeff,...but their developers seem unable to fix things so they work like they used to.
Ouch.
They're doing all they can to put in a better solution ASAP. The problem was that it became impossible to improve the browser-side 'prettifier' without a lot of effort (Javascript regex solutions can only do so much) and they are planning on using a vastly better server-side solution. They're working on it urgently at this very moment.
You can always use my Prettifier in the meantime!
Any word on this, Phil? Copying code from the code windows makes some really ugly stuff with extra line breaks, no indentation, and all those nasty things I posted on the test thread for the code windows.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2009 at 10:16 pm
Peter,
Sorry man. I tested against 1000 lines of 800 comma separated random numbers each in a VARCHAR(8000) environment... your code did the split into a table at a minute and 50 seconds. Many of the other solutions come in at less than 10 seconds for the same thing.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2009 at 10:26 pm
Heh... oh my... :hehe: I believe I've found the secret as to why Cross Apply makes the Tally table solutions run so fast. It forces an unnatural ORDER BY and the same thing can be accomplished in straight code by either using the correct ORDER BY in ROW_NUMBER or by a real ORDER BY in SQL Server 2000 instances.
Without the correct ORDER BY either explicitly or by CROSS APPLY, it takes 1:37 to split 1000 rows of 800 highly randomized numbers to be split into a result table.
With the correct ORDER BY explicitly by either ROW_NUMBER or a real ORDER BY or implicitly by Cross Apply, the same thing runs in about 10 seconds.
SQL Server makes some really bad execution plan decisions on Tally Table based splitters. In the example above and left to it's own devices, it created some hidden RBAR to the tune of about 7.6 million internal rows. Adding one of the ORDER BY methods to force the order by the row number of the row in the source table and THEN by the Tally Table number removed all that hidden RBAR.
I'll see if I can put a nice little test together so that folks can run it.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2009 at 10:53 pm
Ok... first of all, the following code assumes that you have a Tally table that has at least a count of 8000. I'm currently using a Tally table with 1 million rows in it and it. For those that want to play along, here's the code for that (everything built in TempDB so no harm to any existing code)...
[font="Courier New"]--===== Do this in a nice safe place
USE TempDB
--===== Create and populate the Tally table on the fly
SELECT TOP 1000000
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC[/font]
Here's the code I use to build the test cases. Read the comments in the header for how to use it. Currently, it's set to create 1000 rows of 800 randomized numbers for each row.
[font="Courier New"]--drop table jbmtest
/**************************************************************************************************
Purpose:
Generate highly randomized CSV rows of random numbers for CSV Splitter testing. It' virtually
impossible that this code will create any identical rows with this code. It's not "Moby Dick",
but it does produce CSV's that are more likely to be seen in the real world.
I wrote this so that it's compatible with both SQL Server 2000 and 2005 so virtually anyone
can play along. The only thing that needs a manual change to run in 2k is the object of the TOP
statement in the actual table creation code below.
Notes:
1. The code is setup for VARCHAR(8000). To test for VARCHAR(MAX), do a Search'n'Replace to
change all VARCHAR(8000) to VARCHAR(MAX).
2. If you get the following error, you've selected too many elements for the size numbers and
have most likely exceeded the VARCHAR(8000). You simply need to reduce either the range of
numbers or the number of elements, or both. Of course, that probably won't happen with
VARCHAR(MAX) if you've changed to that.
Msg 8152, Level 16, State 14, Line 32
String or binary data would be truncated.
The statement has been terminated.
3. The settings in the code are currently set to make 800 elements with numbers ranging from
1 to 1,000,000,000 (1 to 10 characters) that pretty much fill up VARCHAR(8000).
4. Look for "Change the values here" in the code below.
-- Jeff Moden
**************************************************************************************************/
--===== Change to a safe place to "play"
USE TempDB
GO
--=================================================================================================
-- Conditionally drop the special objects that we'll need to create
-- (just to simplify reruns)
--=================================================================================================
IF OBJECT_ID('TempDB.dbo.RandomPositiveInt') IS NOT NULL
DROP VIEW dbo.RandomPositiveInt
IF OBJECT_ID('TempDB.dbo.RandomIntCSV') IS NOT NULL
DROP FUNCTION dbo.RandomIntCSV
GO
--=================================================================================================
--===== Create a view to return random positive integers starting at 0.
-- This is necessary because you cannot yet use NEWID() in a UDF even in 2k5.
--================================================================================================='
CREATE VIEW dbo.RandomPositiveInt AS
SELECT ABS(CHECKSUM(NEWID())) AS RandomPositiveInt
GO
--=================================================================================================
--===== Create a function to make predictable ranges of CSV'd numbers
--=================================================================================================
CREATE FUNCTION dbo.RandomIntCSV(@NumberOfElements INT, @Range INT, @Offset INT)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @ReturnCSV VARCHAR(8000)
SELECT @ReturnCSV = ISNULL(@ReturnCSV+',','')
+ CAST(rpi.RandomPositiveInt%@Range+@Offset AS VARCHAR(10))
FROM dbo.Tally t
CROSS JOIN dbo.RandomPositiveInt rpi
WHERE t.N <= @NumberOfElements
RETURN @ReturnCSV
END
GO
--=================================================================================================
-- This section of code creates the test table based on
--=================================================================================================
--===== Declare some control variables that we can change the values of to make the generation
-- of different size and width tables easy
DECLARE @NumRows INT, --Number of rows to create
@NumElements INT, --Per row
@ElementRange INT, --Number of numbers in the rangge
@ElementOffSet INT, --First number to appear in the range
@AddDelimiters CHAR(1) --If = 'Y', adds leading/trailing delimiters
--===== Change the values here
SELECT @NumRows = 1000,
@NumElements = 800,
@ElementRange = 1000000000,
@ElementOffSet = 1,
@AddDelimiters = 'Y'
--===== Conditionally drop the test table so we can rebuild it on reruns
IF OBJECT_ID('dbo.JBMTest') IS NOT NULL
DROP TABLE dbo.JBMTest
--===== Create and populate the test table on the fly
SELECT TOP (@NumRows) -- You will need to hard code the value here in 2k.
IDENTITY(INT,1,1) AS RowNum,
dbo.RandomIntCSV(@NumElements,@ElementRange,@ElementOffSet) AS CSV
INTO dbo.JBMTest
FROM dbo.Tally
--===== Conditionally add leading and trailing delimiters
IF @AddDelimiters = 'Y'
UPDATE dbo.JBMTest
SET CSV = ','+CSV+','
--===== Add a Primary Key
ALTER TABLE dbo.JBMTest
ADD CONSTRAINT PK_JBMTest_RowNum
PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100
--===== Display the stats for the new test table just for convenience in testing
SELECT COUNT(*) AS MeasuredRowCount,
@NumElements AS DesignedNumElements,
CASE
WHEN @AddDelimiters = 'Y'
THEN MAX(LEN(CSV) - LEN(REPLACE(CSV,',','')) - 1)
ELSE MAX(LEN(CSV) - LEN(REPLACE(CSV,',','')) + 1)
END AS MeasuredNumElements,
@ElementOffSet AS DesignedElementOffset,
@ElementRange AS DesignedElementRange,
@AddDelimiters AS DelimitersAdded,
MIN(LEN(CSV)) AS MinCsvLength,
AVG(LEN(CSV)) AS AvgCsvLength,
MAX(LEN(CSV)) AS MaxCsvLength
FROM dbo.JBMTest
GO
[/font]
And here's the simple bit of experimenting I did with 2 different forms of explicit order by (we've already seen the Cross Apply run, so not included here)...
[font="Courier New"]--===== This takes a whopping big 1:37 on my box
DROP TABLE Result
DECLARE @delimiter CHAR(1)
SELECT @delimiter = ','
SELECT RowNum,
ROW_NUMBER() OVER (ORDER BY t.N) AS ID,
SUBSTRING(s.CSV, t.N +1, CHARINDEX(@delimiter, s.CSV, t.N +1) -t.N -1) AS Item
INTO dbo.Result
FROM dbo.Tally t
CROSS JOIN jbmtest s
WHERE t.N < LEN(s.CSV)
AND SUBSTRING(s.CSV, t.N, 1) = @delimiter
---------------------------------------------------------------------------------------
--===== This only takes 8 seconds
DROP TABLE Result
DECLARE @delimiter CHAR(1)
SELECT @delimiter = ','
SELECT RowNum,
ROW_NUMBER() OVER (ORDER BY s.RowNum,t.N) AS ID,
SUBSTRING(s.CSV, t.N +1, CHARINDEX(@delimiter, s.CSV, t.N +1) -t.N -1) AS Item
INTO dbo.Result
FROM dbo.Tally t
CROSS JOIN jbmtest s
WHERE t.N < LEN(s.CSV)
AND SUBSTRING(s.CSV, t.N, 1) = @delimiter
---------------------------------------------------------------------------------------
--===== This only takes 8 seconds, as well
DROP TABLE Result
DECLARE @delimiter CHAR(1)
SELECT @delimiter = ','
SELECT RowNum,
SUBSTRING(s.CSV, t.N +1, CHARINDEX(@delimiter, s.CSV, t.N +1) -t.N -1) AS Item
INTO dbo.Result
FROM dbo.Tally t
CROSS JOIN jbmtest s
WHERE t.N < LEN(s.CSV)
AND SUBSTRING(s.CSV, t.N, 1) = @delimiter
ORDER BY s.RowNum,t.n[/font]
By the way, I've done some testing against this same bit of highly randomized data and it seems that type of "real life" data really blows some of the other solutions out of the water. I won't post that testing here because I don't want to steal any of the thunder on Flo's article.
Heh... Long live the Tally table. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2009 at 11:55 pm
Hey Jeff,
Some initial comments:
The version that takes forever includes an index spool in the plan. The modifications you made to the other tally approaches remove that operator. The fastest versions run in a little over ten seconds on my laptop.
The CSV length of ~8000 characters precludes using the fastest CLR implementation (it has a limit of NVARCHAR(4000)).
Using the NVARCHAR(MAX) version of Adam's memory-friendly routine - which has to convert to and from Unicode as well as taking the performance hit imposed by the MAX datatype - runs in just under seven seconds.
Long live the CLR solutions! 😀
Paul
May 10, 2009 at 8:20 am
I agree... the CLR solutions, especially that one, look pretty good.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 391 through 405 (of 522 total)
You must be logged in to reply to this topic. Login to reply