March 15, 2012 at 7:12 am
Jeff Moden (3/15/2012)
peter-757102 (3/15/2012)
Jeff, did you receive any news regarding the performance characteristics of the code (or variations) on SQL Server 2012?And on the extra bright side, in a few months I will be working in a .NET driven development environment, then it is time for me to do some experiments with use of that in SQL Server :).
Haven't tried it there yet but I don't see why any of this would change.
So far as the .Net driven development environment goes, take a look at the CLR splitter that Paul White wrote for the comparision testing in this article.
It certainly could be quite different if one of the following internals have been modified in SQL Server 2012 versus 2008:
* string handling
* memory management
* query optimizer changes (CTE/operator stuff)
Also the difference in performance for varchar(max) and varchar types might be gone as there is no good reason for large difference there from a pure technical perspective. As for .Net coding, I will certainly look at Paul White's splitter. To begin with as a good source to learn from. I haven't done anything .Net much, nor recent.
By the way, did I miss your long string splitter version somehow, or was that work never finished?
March 15, 2012 at 7:31 am
Actually, there is a good reason for the MAX datatypes to have a performance problem with such splitters. They're simply handled differenently even when they're "in row" and because of that, they don't cotton to be joined to.
I've not gotten around to writting an article for the "long splitter" so, no, you didn't miss anything.
And I suspect that no matter what happens to the internals, the cteTally method in the article will continue to be faster than using the other types of splitters simply because all the methods would be affected by the things you suggest could change. Of course, I could be wrong, as well. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
March 15, 2012 at 7:56 am
I am not debating whether it will be the faster method or not, it's simple well thought out contruction will make sure it is. But it can potentially have much better scaling to larger strings if certain internals have been improved.
March 15, 2012 at 5:21 pm
That would certainly be nice.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 15, 2012 at 7:43 pm
I am running your splitter tests on a SQL Server running in a VM. It's taking forever!
I let it go for 17 hours earlier this week, and finally tried canceling the query. But it wouldn't stop. So I killed the process. It was in KILLED/ROLLBACK state for 2 more days and furthermore blocking itself (!). I had to restart the server.
I tried again, starting the splitter tests 5 hours ago. What do you know--it's blocking itself again:
spid 56 blocked by spid 56, waittime 17840231, waittype LCK_M_X, waitresource KEY: 2:562949956108288, cpu 20858, physical_io 1585, open tran 2, status suspended, cmd SELECT INTO, DBCC INPUTBUFFER reveals last statement is the one starting
SELECT Info = 'Please make sure you''re in the grid mode when running these tests.'
Any thoughts?
Given the crazy waittime, and low cpu and io, it seems it got blocked really early and then sat there forever. I bet if I kill it nothing will happen, and if I wait till tomorrow nothing will happen. Sigh.
March 16, 2012 at 7:14 am
Apologies... I have no clue what the problem could be. We have a VM environment at work and it screams.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 16, 2012 at 9:06 am
kinda like those Oldsmobile ads -- "this isn't your father's testing environment."
maybe best stated by Obi Wan Kanobi, "These aren't the test results you're looking for."
March 16, 2012 at 9:22 am
These types of problems with VM are typically are not code related. A lot of folks setup VM to move resources to and from SQLServer servers and, when such moves occur, some really bad things can happen. I don't know if that's Erik's case but thought it was worth mentioning.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 19, 2012 at 11:40 am
Second try: failed. 3 days and 20-some-odd hours later, exact same results as before, only with wait time 334,605,998. 🙂
I think I will not be running this again on my VM SQL instance. That's okay, there are plenty other places I can try it...
The entire SQL Server could not be used for any other purpose for 2 days. I have something logging to a table every minute and it could not operate.
March 22, 2012 at 8:59 pm
ErikEckhardt (3/19/2012)
Second try: failed. 3 days and 20-some-odd hours later, exact same results as before, only with wait time 334,605,998. 🙂I think I will not be running this again on my VM SQL instance. That's okay, there are plenty other places I can try it...
The entire SQL Server could not be used for any other purpose for 2 days. I have something logging to a table every minute and it could not operate.
That's really odd. No problems here with the script. Were you having high CPU or I/O?
/* Anything is possible but is it worth it? */
May 10, 2012 at 1:57 pm
Great article Jeff. I used the splitter today to break apart filenames - amazing performance for our test runs. Can't wait to let it loose across the live data on Monday - 120,000 filenames. Thanks!
:exclamation: "Be brave. Take risks. Nothing can substitute experience." :exclamation:
May 10, 2012 at 5:41 pm
Very cool. Thanks for the success story, Robin. Glad it helped.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 22, 2012 at 12:46 am
Hi,
I have used and referred the same splitter many times but need this to be implemented in a real time environment recently. So, I had to do some more research on it. I started with some minor improvements and ask for suggestions on the sub-site. Many thanks to Oleg, who provided another pointer with a very fruitful discussion and with an interesting 1133 elements, 10000 rows performance test (I am not going into the details and performance comparisons made). I tweaked the physical structure of the tally table a little bit according to the requirement by adding a pre-calculated column N1 which would have the value N+1 but there was only a minor gain as it reduces some IO. But the most important change was conversion of tally table’s column’s data-type i.e. from BIGINT to INT. I have come up with 2 versions so far.
Now, most importantly the performance test. I followed the article’s performance test and I saw major improvement and it beats all the versions. The margin was more significant when the number of elements increases or when the size of the string was bigger. When I used this version of the splitter, for 1133 elements & 10000 rows, the test completed in approximately 19 seconds, which seems very promising as these are still iTVFs.
USE [tempdb]
GO
SET NOCOUNT ON;
GO
IF OBJECT_ID('dbo.CsvTest') IS NOT NULL
DROP TABLE CsvTest
GO
DECLARE @NumberOfElements INT,
@NumberOfRows INT
/*======== PARAMETER VALUES ==============================*/
SELECT @NumberOfElements= 1133
, @NumberOfRows= 10000
/*========================================================*/
SELECT TOP (@NumberOfRows) --Controls the number of rows in the test table
ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
(
SELECT CAST(STUFF( --=== STUFF get's rid of the leading comma
( --=== This builds CSV row with a leading comma
SELECT TOP (@NumberOfElements) --Controls the number of CSV elements in each row
','+CAST(ABS(CHECKSUM(NEWID()))%100000 AS VARCHAR(10))
FROM sys.All_Columns t3 --Classic cross join pseudo-cursor
CROSS JOIN sys.All_Columns t4 --can produce row sets up 121 million.
WHERE t1.Object_ID <> t3.Object_ID --Without this line, all rows would be the same
FOR XML PATH('')
)
,1,1,'') AS VARCHAR(8000))
) AS Csv
INTO CsvTest
FROM sys.All_Columns t1 --Classic cross join pseudo-cursor
CROSS JOIN sys.All_Columns t2 --can produce row sets up 16 million rows
GO
ALTER TABLE CsvTest
ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100;
GO
IF OBJECT_ID('dbo.Tally','U') IS NOT NULL
DROP TABLE dbo.Tally;
--===== Create and populate the Tally table on the fly.
SELECT TOP 11000
IDENTITY(INT,1,1) AS N, --PREVIOUSLY WAS BIGINT
-- This ISNULL function makes the column NOT NULL
ISNULL(ISNULL(CAST(ROW_NUMBER() OVER ( ORDER BY (SELECT NULL)) AS INT),0)+ 1, 0) N1
INTO dbo.Tally
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
GO
--===== Add a CLUSTERED Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
;
--===== Update Stats
UPDATE STATISTICS dbo.Tally WITH FULLSCAN
;
GO
--=====================================================================================================================
-- Conditionally drop and recreate the Split8KTally_With_Column_N1 splitter function.
--=====================================================================================================================
IF OBJECT_ID(N'dbo.Split8KTally_With_Column_N1', N'IF') IS NOT NULL
DROP FUNCTION dbo.Split8KTally_With_Column_N1;
GO
--===== Tally Table (Split8KTally_With_Column_N1 ITVF) ===============================================================================
/*===== THIS FUNCTION USES COLUMN "N1" OF TALLY TABLE HAVING PRE-CALCULATED VALUES FOR N+1 ===========*/
CREATE FUNCTION dbo.Split8KTally_With_Column_N1 (
@pString VARCHAR(8000), @pDelimiter VARCHAR(1)
)
RETURNS TABLE
AS
RETURN
/*======= FIND THE FIRST DELIMITED ITEM EXPLICITLY, AS WE SKIPPED THE FIRST ITEM TO AVOID "OR" CONDITION IN THE WHERE CLAUSE=====*/
SELECT ItemNumber = CAST(0 AS BIGINT) --STARTING WITH ZERO FOR SORT AND TO AVOID ROW_NUMBER + 1 LATER
, ItemValue = SUBSTRING(@pString, 1,
ISNULL(NULLIF(CHARINDEX(@pDelimiter,
@pString COLLATE Latin1_General_BIN), 0)
- 1, 8000 ))
UNION ALL -- CONCATENATE THE OUTPUTS
/*======= FIND REST OF THE DELIMITED ITEMS USING TALLY TABLE ======================================*/
SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY [N] )
, ItemValue = SUBSTRING(@pString, [N1],
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, [N1]), 0)
- [N1], 8000) )
FROM [dbo].[Tally] WITH ( NOLOCK ) --NOLOCK HINT IS NOT NECESSARY
WHERE [N] BETWEEN 1 AND LEN(@pString)
AND SUBSTRING(@pString, [N], 1) = @pDelimiter
GO
--=====================================================================================================================
-- Conditionally drop and recreate the Split8KTally_With_N_PLUS_1 splitter function.
--=====================================================================================================================
IF OBJECT_ID('dbo.Split8KTally_With_N_PLUS_1','IF') IS NOT NULL
DROP FUNCTION [dbo].Split8KTally_With_N_PLUS_1
GO
--===== Tally Table (Split8KTally_With_N_PLUS_1 ITVF) ===============================================================================
/*===== THIS FUNCTION USES RUNTIME CALCULATION "N + 1" INSTEAD OF USING THE PRE-CALCULATED COLUMN "N1" OF THE TALLY TABLE===========*/
CREATE FUNCTION dbo.Split8KTally_With_N_PLUS_1 (
@pString VARCHAR(8000), @pDelimiter VARCHAR(1)
)
RETURNS TABLE
AS
RETURN
/*======= FIND THE FIRST DELIMITED ITEM EXPLICITLY, AS WE SKIPPED THE FIRST ITEM TO AVOID "OR" CONDITION IN THE WHERE CLAUSE=====*/
SELECT ItemNumber = CAST(0 AS BIGINT) --STARTING WITH ZERO FOR SORT AND TO AVOID ROW_NUMBER + 1 LATER
, ItemValue = SUBSTRING(@pString, 1,
ISNULL(NULLIF(CHARINDEX(@pDelimiter,
@pString COLLATE Latin1_General_BIN), 0)
- 1, 8000 ))
UNION ALL -- CONCATENATE THE OUTPUTS
/*======= FIND REST OF THE DELIMITED ITEMS USING TALLY TABLE ======================================*/
SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY [N] )
, ItemValue = SUBSTRING(@pString, (N+1),
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, (N+1)), 0)
- (N+1), 8000) )
FROM [dbo].[Tally] WITH ( NOLOCK )--NOLOCK HINT IS NOT NECESSARY
WHERE [N] BETWEEN 1 AND LEN(@pString)
AND SUBSTRING(@pString, [N], 1) = @pDelimiter
GO
--Tally Test
PRINT '/*====== dbo.Split8KTally_With_N_PLUS_1 =================*/'
DBCC FREEPROCCACHE WITH NO_INFOMSGS
SET STATISTICS TIME ON
DECLARE @ItemNumber BIGINT
, @Item VARCHAR(8000) ;
SELECT @ItemNumber = V.ItemNumber
, @Item = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split8KTally_With_N_PLUS_1(D.Csv, ',') V
SET STATISTICS TIME OFF
PRINT '/*========================================================*/'
PRINT CHAR(10) + CHAR(13)
GO
PRINT '/*====== dbo.Split8KTally_With_Column_N1 =================*/'
DBCC FREEPROCCACHE WITH NO_INFOMSGS
SET STATISTICS TIME ON
DECLARE @ItemNumber BIGINT
, @Item VARCHAR(8000) ;
SELECT @ItemNumber = V.ItemNumber
, @Item = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split8KTally_With_Column_N1(D.Csv, ',') V
SET STATISTICS TIME OFF
PRINT '/*========================================================*/'
PRINT CHAR(10) + CHAR(13)
GO
Last but not least, thanks to all who are contributing voluntarily to help the community, however, cannot thanks enough to one of my favorite mentors Jeff Moden.
P.S. The naming conventions are not according to the standards as the purpose is to convey the core of the function 🙂 Moreover, we all know everyone’s mileage may differ.
May 22, 2012 at 3:30 am
EDIT:
Testing for VARCHAR(MAX) version was being done on SQL 2005 only.
/EDIT
Since, in the future, we may need the VARCHAR(MAX) version as well, I did some testing on the other solutions as tally table solution does not behave well with wider strings. Two of them performed really well
1.Inline Recursive CTE solution (Split_RCTE, Tweaked a little bit for better performance)
2.TVF XML solution by Oleg (Split_XML_Solution_By_Oleg)
But surprisingly, RCTE was winner on some occasions with good margin as compared to losing. If I do the same 10000 rows, 1133 elements with 125-150 characters width, RCTE beat the XML solution marginally. But if the numbers of elements are decreased to 500, the RCTE was almost twice as fast as XML. But I would not count out any of the solutions, as they could fit according the requirement and environment.
For RCTE solution, I must say it is more resource intensive. The index spool and SORT operators in the execution plan clearly indicates that it would hit the memory, tempdb and processors hard. It also reflects that with more CPUs, memory and capacity planned tempdb, the solution is viable to a better performance as compared to other solutions. Having said that, I still may tilt towards the XML solution.
To me, why the RCTE performs much better than the tally table solution (both used the Charindex and Substring), is because the tally table solution compare each character with the delimiter, and due to the Out of Row phenomenon, this does not scale well. Whereas, RCTE solution does it for only the required number of times. Moreover, till VARCHAR(8000), the SUBSTRING(@pString, N,1) = @pDelimiter
is dealt as a predicate in addition to the Seek predicate. But for VARCHAR(MAX), this is divided into two steps i.e. A Seek predicate followed by a Filter predicate which decreases the performance quite a bit.
BTW, this is what my tally table VARCHAR(max) version is
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
GO
--Create Split Tally Function VARCHAR(MAX) Version
CREATE FUNCTION dbo.Split8KTallyM (
@pString VARCHAR(MAX), @pDelimiter VARCHAR(1)
)
RETURNS @Results TABLE (ItemNumber BIGINT, ItemValue VARCHAR(MAX))
AS
BEGIN
SET @pString = @pDelimiter --To avoid the OR condition in the WHERE clause for the first item
+ @pString
+ @pDelimiter --To avoid the ISNULL/NULLIF
INSERT INTO @Results
(
[ItemNumber]
,[ItemValue]
)
SELECT ItemNumber = ROW_NUMBER() OVER ( ORDER BY N )
, Item = SUBSTRING(@pString, N1,
CAST(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, N1) AS INT)
- N1)
FROM [dbo].[Tally] WITH ( NOLOCK ) --NOLOCK HINT IS NOT NECESSARY
WHERE N BETWEEN 1 AND
CONVERT(INT,LEN(@pString)) -- CAST TO THE COLUMN DATATYPE
- 1 -- TO EXCLUDE LAST DELIMITER
AND SUBSTRING(@pString, N,1) = @pDelimiter
RETURN;
END
GO
Now the performance test for RCTE and XML solutions (Did not try the tally table solution as it is quite slow)
USE [tempdb]
GO
SET NOCOUNT ON;
GO
IF OBJECT_ID(N'dbo.iFunction', N'V') IS NOT NULL
DROP VIEW iFunction
GO
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
GO
CREATE VIEW dbo.iFunction AS
/**********************************************************************************************************************
Purpose:
This view is callable from UDF's which allows us to indirectly get a NEWID() within a function where we can't do such
a thing directly in the function. This view also solves the same problem for GETDATE().
Usage:
SELECT MyNewID FROM dbo.iFunction; --Returns a GUID
SELECT MyDate FROM dbo.iFunction; --Returns a Date
Revision History:
Rev 00 - 06 Jun 2004 - Jeff Moden - Initial creation
Rev 01 - 06 Mar 2011 - Jeff Moden - Formalize code. No logic changes.
**********************************************************************************************************************/
SELECT MyNewID = NEWID(),
MyDate = GETDATE();
GO
IF OBJECT_ID(N'dbo.CsvTest', N'U') IS NOT NULL
DROP TABLE CsvTest
GO
DECLARE @MaxElementWidth INT,
@MinElementWidth INT,
@NumberOfElements INT,
@NumberOfRows INT
/*======== PARAMETER VALUES ==============================*/
SELECT @MaxElementWidth= 150
, @MinElementWidth= 125
, @NumberOfElements= 500
, @NumberOfRows= 10000
/*========================================================*/
SELECT TOP (@NumberOfRows) --Controls the number of rows in the test table
ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
CSV =
(--==== This creates each CSV
SELECT CAST(
STUFF( --=== STUFF get's rid of the leading comma
( --=== This builds CSV row with a leading comma
SELECT TOP (@NumberOfElements) --Controls the number of CSV elements in each row
','
+ LEFT(--==== Builds random length variable within element width constraints
LEFT(REPLICATE('1234567890',CEILING(@MaxElementWidth/10.0)), @MaxElementWidth),
ABS(CHECKSUM((SELECT MyNewID FROM dbo.iFunction)))
% (@MaxElementWidth - @MinElementWidth + 1) + @MinElementWidth
)
FROM sys.All_Columns ac3 --Classic cross join pseudo-cursor
CROSS JOIN sys.All_Columns ac4 --can produce row sets up 16 million.
WHERE ac3.Object_ID <> ac1.Object_ID --Without this line, all rows would be the same.
FOR XML PATH('')
)
,1,1,'')
AS VARCHAR(MAX))
)
INTO CsvTest
FROM sys.All_Columns ac1 --Classic cross join pseudo-cursor
CROSS JOIN sys.All_Columns ac2 --can produce row sets up 16 million rows
GO
PRINT '/*====== dbo.CSVTest Population completed ================*/'
PRINT CHAR(10) + CHAR(13)
GO
ALTER TABLE CsvTest
ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100;
GO
IF OBJECT_ID(N'dbo.Split_RCTE', N'IF') IS NOT NULL
DROP FUNCTION dbo.Split_RCTE
GO
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
GO
--Create Split_RCTE function VARCHAR(MAX) version
CREATE FUNCTION dbo.Split_RCTE
(
@pString VARCHAR(MAX)
,@pDelimiter VARCHAR(1)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH cteSplit
AS ( SELECT StartPosition = 0
, EndPosition = CONVERT(INT, CHARINDEX(@pDelimiter,
@pString COLLATE Latin1_General_BIN))
UNION ALL
SELECT StartPosition = EndPosition + 1
, EndPosition = CONVERT(INT, CHARINDEX(@pDelimiter,
@pString COLLATE Latin1_General_BIN,
EndPosition + 1))
FROM cteSplit
WHERE EndPosition > 0
)
SELECT [ItemNumber] = ROW_NUMBER() OVER ( ORDER BY StartPosition )
, SUBSTRING(@pString, StartPosition,
CASE EndPosition
WHEN 0 THEN CONVERT(INT, LEN(@pString)) + 1
ELSE EndPosition - StartPosition
END) ItemValue
FROM cteSplit
GO
IF OBJECT_ID(N'dbo.Split_XML_Solution_By_Oleg', N'TF') IS NOT NULL
DROP FUNCTION dbo.Split_XML_Solution_By_Oleg
GO
--Create Split_XML function VARCHAR(MAX) version
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
GO
CREATE FUNCTION dbo.Split_XML_Solution_By_Oleg
(
@Parameter VARCHAR(MAX)
,@Delimiter VARCHAR(1)
)
RETURNS @Result TABLE
(
ItemNumber INT
,ItemValue VARCHAR(MAX)
)
AS
BEGIN
DECLARE @XML XML ;
SET @Parameter = ( SELECT @Parameter
FOR XML PATH('')
) ;
SELECT @XML = '<r>' + REPLACE(@Parameter,@Delimiter, '</r><r>') + '</r>' ;
INSERT INTO @Result
(
ItemNumber
,ItemValue
)
SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL) ) AS ItemNumber
, Item.value('text()[1]', 'VARCHAR(MAX)') AS ItemValue
FROM @XML.nodes('//r') R ( Item ) ;
RETURN ;
END ;
GO
PRINT '/*====== dbo.Split_RCTE ==================================*/'
DBCC FREEPROCCACHE WITH NO_INFOMSGS
SET STATISTICS TIME ON
DECLARE @ItemNumber BIGINT
, @Item VARCHAR(MAX) ;
SELECT @ItemNumber = V.ItemNumber
, @Item = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split_RCTE(D.Csv, ',') V
OPTION ( MAXRECURSION 0 )
SET STATISTICS TIME OFF
PRINT '/*========================================================*/'
PRINT CHAR(10) + CHAR(13)
GO
PRINT '/*====== dbo.Split_XML ===================================*/'
DBCC FREEPROCCACHE WITH NO_INFOMSGS
SET STATISTICS TIME ON
DECLARE @ItemNumber BIGINT
, @Item VARCHAR(MAX) ;
SELECT @ItemNumber = V.ItemNumber
, @Item = V.ItemValue
FROM dbo.CsvTest D
CROSS APPLY dbo.Split_XML_Solution_By_Oleg(D.Csv, ',') V
SET STATISTICS TIME OFF
PRINT '/*========================================================*/'
PRINT CHAR(10) + CHAR(13)
GO
As always, everyone's mileage may differ.
May 22, 2012 at 3:40 am
Too bad (for me) this new round cames at a bad time as I will be too bussy this week to particpate in this round of challenges.
As for bad performance with some implementations, please try to bench with option( maxdop 1) too as there exists a bug in SQL Server 2008 and later that can (unpredictably) cause extreme slowdowns when parallel plans are used. We would not want to see results tainted by this of course.
Viewing 15 posts - 286 through 300 (of 990 total)
You must be logged in to reply to this topic. Login to reply