May 25, 2011 at 9:01 am
Thanks for the great article and the code that goes with it.
Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)? I modified the provided function to take a (MAX) as a parameter and it seems to work. To account for the increase in size, it is using an existing Tally table rather than the CTE, even though the tested CTE version outperforms the table-based version. That seemed simpler than adding more layers to the compounding CTEs, and the benchmarking was based on generating 10K rows rather than 100K or 1,000K rows.
Do you see any reason why a VARCHAR(MAX) would not work? I'm still testing it, but your thoughts would be appreciated.
May 25, 2011 at 10:02 am
fahey.jonathan (5/25/2011)
Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)?
Performance. It goes wacko as soon as you start sending in > 8000 characters to the function. (However, in my limited testing, the varchar(max) seems to work fine with <= 8000 characters being sent to it.)
A DelimitedSplitMax version is in the works, but it's not ready for prime time yet.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
May 25, 2011 at 10:24 am
fahey.jonathan (5/25/2011)
Thanks for the great article and the code that goes with it.Is there a reason why you chose a VARCHAR(8000) instead of a VARCHAR(MAX)? I modified the provided function to take a (MAX) as a parameter and it seems to work. To account for the increase in size, it is using an existing Tally table rather than the CTE, even though the tested CTE version outperforms the table-based version. That seemed simpler than adding more layers to the compounding CTEs, and the benchmarking was based on generating 10K rows rather than 100K or 1,000K rows.
Do you see any reason why a VARCHAR(MAX) would not work? I'm still testing it, but your thoughts would be appreciated.
Yes. VARCHAR(MAX) doesn't like to be joined to. In the testing that I did, just changing the input to VARCHAR(MAX) and still using it for something less than VARCHAR(8000) immediately caused a 2:1 slowdown. Like Wayne said, a VARCHAR(MAX) splitter is in the works using a similar method but it's just not ready yet. Of course, now that you've said what you've said, some retesting for VARCHAR(MAX) appears to be in order across different machines.
If you can use a CLR like the one in the test attachment to the article, that would be the best thing to do for splitting. Understood if you cannot.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 25, 2011 at 11:53 am
I've tested it using 36014 characters (list of email addresses) and the response time is under 1 second:
-----------------------
2011-05-25 12:46:54.640
2011-05-25 12:46:55.253
For the few times that I will be using this, that speed is acceptable. I wonder if using a table-based Tally table makes any difference to the speed given the larger number of records that need to be generated using the CTE version. I did not try using the CTE version scaled up to (MAX) size.
I'm eager to see your revised version when it is complete.
May 25, 2011 at 12:11 pm
Here's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.
I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.
Mark
May 25, 2011 at 12:35 pm
mark hutchinson (5/25/2011)
@JeffHere's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.
I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.
Mark
Good idea, but you just can't split it every 8000 characters. You have to split that varchar(max) into chunks <= 8000 chars, AND split at a delimiter. Any ideas?
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
May 25, 2011 at 12:56 pm
it depends on the ability to split the max string at a delimiter into <8000 sized chunks reliably. Otherwise, the ends of the parsed strings sets would need to be concatenated.
If I had the delimiter positions, I might be able to split first on the greatest position less than 8000 and then the greatest delimiter position whose difference form the prior delimiter position was <8000.
I have no idea how to write the T-SQL nor even if it can be written. It might perform like a dog. I just don't know.
Mark
May 25, 2011 at 1:49 pm
mark hutchinson (5/25/2011)
@JeffHere's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.
I have no idea if it can or how it would be implemented, but it was an out-of-the-box idea that just struck me. Thought I would share the idea in case I get struck by a bus.
Mark
That's the basis of the code I wrote. It runs very, very fast. Wayne and I have done some back and forth "what ifs"... I just haven't had the time to bring it up to releasable code, yet. By "Releasable Code", I mean code that's been fully documented and tested for functionality, performance, and scalability on at least a half dozen different types of machines across at least 2 revs of SQL. I've just gotta get some time to make the right kind of test harness, test data, and package to send to a very willing and helpful group of external testers. Considering the number of fires I currently have sticks in, it's not going to happen in the next week or two. 😉
Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:
--Jeff Moden
Change is inevitable... Change for the better is not.
May 25, 2011 at 1:58 pm
Jeff Moden (5/25/2011)
Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:
Yes, I know. But, when you're selling software solutions, you can't mandate their use to your customers.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
May 25, 2011 at 2:02 pm
mark hutchinson (5/25/2011)
@Wayneit depends on the ability to split the max string at a delimiter into <8000 sized chunks reliably. Otherwise, the ends of the parsed strings sets would need to be concatenated.
If I had the delimiter positions, I might be able to split first on the greatest position less than 8000 and then the greatest delimiter position whose difference form the prior delimiter position was <8000.
I have no idea how to write the T-SQL nor even if it can be written. It might perform like a dog. I just don't know.
Mark
And what happens if the delimiter is at position 8000
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
May 25, 2011 at 8:07 pm
WayneS (5/25/2011)
Jeff Moden (5/25/2011)
Besides... everyone knows to use a CLR for these type of things, right. :-P:-P:-P:hehe:Yes, I know. But, when you're selling software solutions, you can't mandate their use to your customers.
Heh... absolutely true. That's why I had all the smiley faces on my last. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
June 21, 2011 at 1:32 pm
How about this to get rid of the CHARINDEX() string operation:
CREATE FUNCTION dbo.DelimitedSplit8K
--===== Define I/O parameters
(@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT 0 UNION ALL
SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
),
cteStartEnd(Nb, Ne) AS ( -- Associate starting and ending positions
SELECT N1, ISNULL((SELECT MIN(N1) FROM cteStart ce WHERE ce.N1 > cs.N1), 8001)
FROM cteStart cs
)
--===== Do the actual split.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY se.Nb),
Item = SUBSTRING(@pString,se.Nb,se.Ne - se.Nb - 1)
FROM cteStartEnd se
;
...how does that affect performance?
edited to add: Actually performance goes down a bit because the subquery in making cteStartEnd is not indexed. This version does exhibit performance gains in my very modest testing:
CREATE FUNCTION dbo.DelimitedSplit8K
--===== Define I/O parameters
(@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
DECLARE @t_delim TABLE(offset smallint PRIMARY KEY); -- for the index
-- Get the offsets into the in-memory indexed table
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT 0 UNION ALL
SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
INSERT INTO @t_delim(offset) --==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
;
RETURN
WITH
cteStartEnd(Nb, Ne) AS ( -- Associate starting and ending positions
SELECT offset, ISNULL((SELECT MIN(offset) FROM @t_delim ce WHERE ce.offset > cs.offset), 8001)
FROM @t_delim cs
)
--===== Do the actual split.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY se.Nb),
Item = SUBSTRING(@pString,se.Nb,se.Ne - se.Nb - 1)
FROM cteStartEnd se
;
(Note: I'm doing my version as a SP so the above exact code hasn't actually been tested by me...)
August 8, 2011 at 2:11 pm
mark hutchinson (5/25/2011)
@JeffHere's an idea to kick around for varchar(max) data...use a tally table (or cte equivalent) to split the very large string into strings less than 8000 characters and then rejoin those parsed strings.
Mark, we used Jeff's function and expanded it out to varchar(max) with a fair amount of linearity, though I'm looking forward to seeing Jeff's with max handled.
String Length : Elements : CPU Time in ms
74242
148482
296962
5921923
118438432
23687687
4736153613
9472307256
18944614449
378881228896
7577624576193
15155249152385
303104983045006
60620819660810085
121241639321619989
UDF
/*
See sometime for improvements to this: http://www.sqlservercentral.com/articles/Tally+Table/72993/
Jeff Moden 2011-05-02
*/
CREATE FUNCTION udf_StrList2Table (
@List NVARCHAR(MAX) = N'',
@Delimiter NCHAR(1) = N','
)
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
/*
Following inline delimited spliter written by Jeff Moden.
http://www.sqlservercentral.com/articles/Tally+Table/72993/
*/
WITH E1(N) AS
( --=== Create Ten 1's
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 --10
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000
E8(N) AS
(
--==== Limits the number of rows right up front for both a performance gain and prevention of accidental "overruns"
SELECT TOP (DATALENGTH(ISNULL(@List,1))) 1 FROM E4 a, E4 b -- 100,000,000
),
cteTally(N) AS
( --==== This provides the "zero base"
SELECT 0 UNION ALL
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E8
),
cteStart(N1) AS
( --==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT t.N+1 FROM cteTally t WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0)
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT
ROW_NUMBER() OVER(ORDER BY s.N1) as ListPos,
SUBSTRING(@List,s.N1,ISNULL(NULLIF(CHARINDEX(@Delimiter,@List,s.N1),0)-s.N1,DATALENGTH(ISNULL(@List,1)))) as Value
FROM cteStart s
GO
Testing
DECLARE
@List NVARCHAR(MAX),
@Delimiter NCHAR(1) = N',',
@ListCnt int,
@StartTime datetime2,
@EndTime datetime2,
@ReplicationSeed int = 2,
@ReplicationMultiplier int = 2
DECLARE @Results TABLE (ListLength int, ElementCount int, MSTime int)
WHILE @ReplicationSeed <= 50000
BEGIN
SELECT @List = REPLICATE(CAST(N'ab,a,aeae,3,3,a3,23,4,asa,,434,q4,345' as nvarchar(max)), @ReplicationSeed)
SELECT @StartTime = SYSDATETIME()
SELECT @ListCnt = COUNT(*) FROM udf_StrList2Table(@List, @Delimiter)
SELECT @EndTime = SYSDATETIME()
INSERT INTO @Results (ListLength, ElementCount, MSTime)
SELECT LEN(@List), LEN(@List) - LEN(REPLACE(@List, ',', '')), DATEDIFF(MS, @StartTime, @EndTime)
SELECT
@ReplicationSeed = @ReplicationSeed * @ReplicationMultiplier
END
SELECT * FROM @Results
/* Anything is possible but is it worth it? */
August 12, 2011 at 3:01 am
Gatekeeper,
Be careful testing sample data this way, where you have the constants in the same batch as the code that you are benchmarking. The replicate function does not "hide" the final string to be benchmark with. There is a reason Jeffs benchmarking involves a table of random stings, one of these reason is to avoid tainted results by giving the optimizer chances to skip logic it normally would have to perform at runtime.
I am not saying your test code neccecarily falls victim to this, but make sure it does not!
And thank you for your contribution. I will have to test how that E8 addition affects performance on input that never needs those high end numbers, and if it doesn't affect it, I am going to incorporate it in the thus far fastest version and use it in my own projects.
Cheers!
Viewing 15 posts - 256 through 270 (of 990 total)
You must be logged in to reply to this topic. Login to reply