April 14, 2014 at 11:27 pm
Jeff Moden (4/14/2014)
I agree. A CLR would be the best way and you made it easy enough for everyone to use. Thanks. The "Split" function in the original article was much faster than any of the T-SQL methods.
I don't know for sure that the "max" version with multi-character delimiters I posted will beat Miller's method, but it's there if (s)he wants to try it.
The only thing I don't like about the one that Adam posted is that it ignores leading/trailing and consecutive delimiters. For the kind of work I do, those are important especially for "ragged right" splits.
I think we had this conversation a long time ago, which lead me to revise it. The version I presented works as you would wish, I think.
The link to Adam's article is broken. Here it is again...
Works for me, and I haven't edited it? No worries either way.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 14, 2014 at 11:33 pm
FYI: Ran the function through the test script with the same result aggregation as I used for the article.
😎
SplitterName TOTAL_D AVG_D
---------------------------- ------------ -----------
DelimitedSplit8K_T1_LEAD_V2 18302.00000 373.510204
DelimitedSplit8K_T1_LEAD 18587.00000 379.326530
BetterSplit 20307.00000 414.428571
April 15, 2014 at 4:13 am
Paul White (4/14/2014)
At the risk of making a predictable contribution at this point, based on Miller's stated requirements for long Unicode strings and multi-character delimiters:.
.
.
This version uses .net to supply the sub-strings, an approach so far I seen all splitters take.
A while back I wrote one that just delivers locations of the sub-strings back to SQL. The idea being that the .NET to SQL interface is leaner and would allow SQL Server to be smarter in its memory usage by for example utilizing partial strings internally, instead of fresh copies. That would hardly consume additional memory. For sure the resulting datatype and collation would be in-line with the original input, something which no other version i seen does. It also means no conversions from the a static Unicode result that .NET delivers are needed when parsing varchar input (saving both memory and time).
I never got to run benchmark tests for it as it is part of a large more complete lib which I haven't even fully designed and finished. Too bad I ran out of time and had to focus on other things, but it will be done at one point. I think the standard way of splitting has room to be improved upon, efficiency wise.
April 15, 2014 at 8:17 pm
It is unfortunate that this that there are some seemingly inexplicable limitations in what you can do with T-SQL.
For example, I wondered what would happen if we were able to removed all usage of SUBSTRING and simply convert the delimited string into scripted data with a single execution of the REPLACE function, then execute the dynamic sql to generate the resultant table. This works great and is incredibly fast, but for several reasons, this will not work as a function, only as a stored procedure.
Below is a procedure that does the split with no tally and no substring, no charindex.
CREATE procedure [dbo].[TallyNo]
(@L NVARCHAR(MAX),@D NVARCHAR(100))AS
DECLARE @E VARCHAR(MAX)
SET @E='SELECT ItemNumber=ROW_NUMBER()OVER(ORDER BY(SELECT 1)),Item FROM(values('''+REPLACE(@L,@D,'''),(''')+'''))V(Item)'
EXEC(@E);
April 26, 2014 at 12:19 pm
Paul White (4/14/2014)
Jeff Moden (4/14/2014)
I agree. A CLR would be the best way and you made it easy enough for everyone to use. Thanks. The "Split" function in the original article was much faster than any of the T-SQL methods.I don't know for sure that the "max" version with multi-character delimiters I posted will beat Miller's method, but it's there if (s)he wants to try it.
The only thing I don't like about the one that Adam posted is that it ignores leading/trailing and consecutive delimiters. For the kind of work I do, those are important especially for "ragged right" splits.
I think we had this conversation a long time ago, which lead me to revise it. The version I presented works as you would wish, I think.
The link to Adam's article is broken. Here it is again...
Works for me, and I haven't edited it? No worries either way.
Apologies for the late reply.
We did indeed and thank you very much for not-only tweaking the CLR, Paul, but for offering it up to begin with and for making it easy to instantiate. I don't even know how to spell C# and couldn't have done it on my own. You really came through on that and made an essential comparison possible. Thank you very much, again!
--Jeff Moden
Change is inevitable... Change for the better is not.
April 29, 2014 at 5:54 pm
SQL 2014 is kinda cool...
I just tried out using a memory optimized table for the Tally / Numbers table in DelimitedSplit8K and two things stand out.
1. It seems to be much quicker than SQL 2012 generally.
2. The memory optimised Tally Splitter wipes the floor with the stacked cte / IBG style computed tally apparently.
For those interested, attached results from my desktop PC.
DelimitedSplit8K_M_O is this:
CREATE FUNCTION [dbo].[DelimitedSplit8K_M_O]
--===== Define I/O parameters
(@pString VARCHAR(8000) , @pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== SQL 2014 Memory Optimised "Tally Table" produces values from 1 up to 1,000,000...
-- enough to cover VARCHAR(8000)
WITH
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM dbo.moTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter AND N<=(ISNULL(DATALENGTH(@pString),0))
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,@pString ,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteLen l
;
The MO Tally is this:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[moTally]
(
[N] [int] NOT NULL,
CONSTRAINT [pk_moTally] PRIMARY KEY NONCLUSTERED
(
[N] ASC
)
)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
April 29, 2014 at 8:14 pm
You could also play around with a native-compiled procedure, something like:
-- Memory optimized table type
CREATE TYPE dbo.SplitReturn AS TABLE
(
ItemNumber integer IDENTITY PRIMARY KEY NONCLUSTERED,
Item varchar(8000) NOT NULL
)
WITH (MEMORY_OPTIMIZED = ON);
GO
CREATE PROCEDURE dbo.Split
@pString varchar(8000),
@pDelimiter char(1)
WITH
NATIVE_COMPILATION,
SCHEMABINDING,
EXECUTE AS OWNER
AS
BEGIN ATOMIC
WITH
(
TRANSACTION ISOLATION LEVEL = SNAPSHOT,
LANGUAGE = 'us_english'
)
DECLARE @return AS dbo.SplitReturn;
DECLARE @pos AS integer = 1;
DECLARE @Last AS integer = 1;
WHILE @pos <= LEN(@pString)
BEGIN
IF SUBSTRING(@pString, @pos, 1) = @pDelimiter
BEGIN
IF @pos = 1
INSERT @return (Item)
VALUES (SUBSTRING(@pString, @Last, @pos - @Last + 1));
ELSE
INSERT @return (Item)
VALUES (SUBSTRING(@pString, @Last, @pos - @Last));
SET @Last = @pos + 1;
END;
SET @pos += 1;
END;
INSERT @return (Item)
VALUES (SUBSTRING(@pString, @Last, @pos - @Last + 1));
SELECT
R.ItemNumber,
R.Item
FROM @return AS R
ORDER BY
R.ItemNumber;
END;
GO
-- Example usage
EXECUTE dbo.Split
@pString = 'A**B*C*',
@pDelimiter = '*';
Not very useful given that it's limited to splitting one string per procedure call, I admit 🙂
It's a shame functions don't support native compilation, and crazy that basics like CHARINDEX aren't supported for native, but maybe we'll see these things in a future version.
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 29, 2014 at 9:08 pm
I was really looking forward to that type of fuctionality. It IS a step in the right direction, though. Thanks, Paul.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 30, 2014 at 4:34 am
FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.
😎
SplitterName TOTAL_D AVG_D
-------------------------- -------- ---------
DelimitedSplit8K_M_O_LEAD 2.96400 0.060489
DelimitedSplit8K_M_O 2.99700 0.061163
Split 4.06600 0.082979
April 30, 2014 at 5:26 am
Eirikur Eiriksson (4/30/2014)
FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.
The memory optimized version of the tally splitter is certainly an improvement, and certainly very competitive with the CLR splitter especially on small strings, where it can be faster. My tests show it is still 50% slower for long strings with many elements.
Bear in mind that the memory optimized improvement requires SQL Server 2014 Enterprise Edition. Also, as currently written, it does not handle Unicode, and is limited to 8000 non-Unicode characters.
None of these limitations apply to the CLR splitter; it works on all versions and editions of SQL Server from 2005 onward, and handles Unicode and max-length strings all in one.
😎
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 30, 2014 at 5:55 am
Paul White (4/30/2014)
Eirikur Eiriksson (4/30/2014)
FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.The memory optimized version of the tally splitter is certainly an improvement, and certainly very competitive with the CLR splitter especially on small strings, where it can be faster. My tests show it is still 50% slower for long strings with many elements.
Bear in mind that the memory optimized improvement requires SQL Server 2014 Enterprise Edition. Also, as currently written, it does not handle Unicode, and is limited to 8000 non-Unicode characters.
None of these limitations apply to the CLR splitter; it works on all versions and editions of SQL Server from 2005 onward, and handles Unicode and max-length strings all in one.
😎
No doubt about the CLR splitter being generally better, my point is that when using the window function on the normal Delimiter8K splitter, it is up to 50% faster than the Charindex one. With the memory optimized tally table, the difference only starts to show on longer strings, in fact the window function version was faster than the CLR all the way through (see the attachment).
😎
April 30, 2014 at 6:39 am
Eirikur Eiriksson (4/30/2014)
No doubt about the CLR splitter being generally better, my point is that when using the window function on the normal Delimiter8K splitter, it is up to 50% faster than the Charindex one.
Forgive me, but I didn't get any sense of that point from the post I replied to. Never mind.
With the memory optimized tally table, the difference only starts to show on longer strings, in fact the window function version was faster than the CLR all the way through (see the attachment).
The LEAD version is indeed impressive (albeit with limitations).
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 30, 2014 at 7:07 am
Eirikur Eiriksson (4/30/2014)
FYI: ran the test on the CLR Split, mm's memory optimized version and the lead window function version. Hardly any benefit of using the window function, 0.001 on average and 0.033 in total runtime. Still 25% faster than the CLR.😎
SplitterName TOTAL_D AVG_D
-------------------------- -------- ---------
DelimitedSplit8K_M_O_LEAD 2.96400 0.060489
DelimitedSplit8K_M_O 2.99700 0.061163
Split 4.06600 0.082979
Well, that saved me the effort - that was going to be my next test 😀
Paul - absolutely agree that the CLR implementation is the nuts, what we are seeking is incremental improvements to the non-CLR version (despite it's limitations in this form) for those systems that will/can not use CLR.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
May 9, 2014 at 1:59 pm
Have you tried switching the memory optimized tally table durability to just schema? I read that a table will perform best in this mode and its pretty easy to add a start up script for a db to populate the tally table with rows.
May 9, 2014 at 6:06 pm
mburbea (5/9/2014)
Have you tried switching the memory optimized tally table durability to just schema? I read that a table will perform best in this mode and its pretty easy to add a start up script for a db to populate the tally table with rows.
Yes, the performance was almost identical to the millisecond, which seems unlikely but that's what happened.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
Viewing 15 posts - 586 through 600 (of 990 total)
You must be logged in to reply to this topic. Login to reply