While SQL Server 2016 hasn’t been released as of the time of this writing, it has entered into a Release Candidate state, currently RC0. One of the new features introduced in RC0 is the addition of the STRING_SPLIT function. This has resulted in the internet being abuzz with excitement about this function. This article will examine this function, including how it works, how it compares to other string splitting functions, and how it compares in performance to these other functions.
The purpose of the STRING_SPLIT function is to take a string that has been created by concatenating various elements together, using a delimiter to separate the elements, back into the individual elements. The syntax for this function is quite simple: STRING_SPLIT(string, separator), where string is the string to split and separator is the string to separate the various elements of the concatenated string. string can be any of the base character data types (char, nchar, varchar, nvarchar), and separator can be any of nchar(1), char(1), nvarchar(1) or varchar(1).
This function is implemented as a table-valued function (TVF), therefore it returns a table result set. This result set consists of one column, named value. If both string and separator are char/varchar, then the resulting data type of the value column is a varchar; if either string or separator is of a Unicode data type (nchar / nvarchar), then the resulting data type of the value column is nvarchar.
The first thing that I want to examine on this new function is whether the function operates as described. First up, let’s try splitting a string with a separator greater than one character. This query:
SELECT * INTO #temp FROM STRING_SPLIT(N'Hello, World, It''s, Me!', ', ');
Returns the error message:
Msg 214, Level 16, State 11, Line 3
Procedure expects parameter 'separator' of type 'nchar(1)/nvarchar(1)'.
Removing the space at the end of the separator allows this query to run. This does prove that the documentation is correct. Which, to me, is slightly disappointing – it would be nice if it could handle a multi-character delimiter.
Looking closer at the error message, it specifically says nchar/nvarchar, but doesn’t mention char/varchar. This makes me wonder about the data type of the result set. Let’s test that out next. In this example, I’ll send the results to a temp table, and then query the sys.columns DMV to see what data type was created:
SELECT * INTO #temp FROM STRING_SPLIT(N'Hello, World, It''s, Me!', ','); SELECT sc.name AS ColumnName, st.name AS DataTypeName FROM tempdb.sys.columns sc JOIN tempdb.sys.types st ON sc.system_type_id = st.system_type_id AND sc.user_type_id = st.user_type_id WHERE object_id = OBJECT_ID('tempdb.dbo.#temp');
Which returns this result set:
ColumnName DataTypeName
value varchar
Changing either of the parameters to be Unicode changes the resultant data type to be nvarchar, This confirms the first couple of details about how the function works are true as documented.
Functionality Testing
The next item to verify is the results – does the function properly handle delimited strings, even when thrown obstacles that will trip up some of the other string splitter routines? The following code, from Jeff Moden, tests that the splitter handles these conditions.
SELECT dt.*, '"' + ss.value + '"' AS ReturnValue FROM (VALUES (N'' ,N'Empty String'), (N' ' ,N'Blank Element'), (NULL ,N'Null Element'), (N'Single Element' ,N'Single Element'), (N',Leading/Trailing,' ,N'Leading/Trailing Delimiter #1' ), (N',Leading,Trailing,' ,N'Leading/Trailing Delimiter #2' ), (N'Adjacent,,Delimiters' ,N'Adjacent Delimiters #1' ), (N'Adjacent,,,Delimiters',N'Adjacent Delimiters #2' ), (N',' ,N'Delimiters Only #1' ), (N',,' ,N'Delimiters Only #2' ), ('Mr. & Mrs.' ,N'All characters' ), (N'"Embedded, Delimiter"',N'Embedded Delimiter') )dt (String,TestName) CROSS APPLY STRING_SPLIT(dt.String, ',') ss;
When this code is run, it returns the following result set:
String | TestName | ReturnValue |
Empty String | "" | |
Blank Element | " " | |
Single Element | Single Element | "Single Element" |
,Leading/Trailing, | Leading/Trailing Delimiter #1 | "" |
,Leading/Trailing, | Leading/Trailing Delimiter #1 | "Leading/Trailing" |
,Leading/Trailing, | Leading/Trailing Delimiter #1 | "" |
,Leading,Trailing, | Leading/Trailing Delimiter #2 | "" |
,Leading,Trailing, | Leading/Trailing Delimiter #2 | "Leading" |
,Leading,Trailing, | Leading/Trailing Delimiter #2 | "Trailing" |
,Leading,Trailing, | Leading/Trailing Delimiter #2 | "" |
Adjacent,,Delimiters | Adjacent Delimiters #1 | "Adjacent" |
Adjacent,,Delimiters | Adjacent Delimiters #1 | "" |
Adjacent,,Delimiters | Adjacent Delimiters #1 | "Delimiters" |
Adjacent,,,Delimiters | Adjacent Delimiters #2 | "Adjacent" |
Adjacent,,,Delimiters | Adjacent Delimiters #2 | "" |
Adjacent,,,Delimiters | Adjacent Delimiters #2 | "" |
Adjacent,,,Delimiters | Adjacent Delimiters #2 | "Delimiters" |
, | Delimiters Only #1 | "" |
, | Delimiters Only #1 | "" |
, | Delimiters Only #2 | "" |
, | Delimiters Only #2 | "" |
, | Delimiters Only #2 | "" |
Mr. & Mrs. | All characters | "Mr. & Mrs." |
"Embedded, Delimiter" | Embedded Delimiter | ""Embedded" |
"Embedded, Delimiter" | Embedded Delimiter | " Delimiter"" |
The query passes each test condition into the STRING_SPLIT function, and returns the results from the splitting. Examining the results, we can see that it handled everything properly.
Hang on a minute! Something is missing… the NULL input does not have a corresponding row in the result set! This is consistent with the way that SQL handles a NULL in the OPENXML and OPENJSON table-valued functions, as explained in this connect item that I opened up about this: https://connect.microsoft.com/SQLServer/feedback/details/2454210/string-split-function-returns-no-result-when-a-null-value-is-passed-in. It looks like this is by design, and it won’t be changed. That being said, all is not lost. If you need to return a NULL with a NULL input, just use the OUTER APPLY operator instead of the CROSS APPLY operator.
At this point, we have verified that the STRING_SPLIT function works as expected (well, unless you expect a NULL input to have an output).
Perhaps you may have noticed that I just mentioned another new function that was introduced in SQL Server 2016: OPENJSON. Yes, SQL Server 2016 introduces JSON support. This function takes a JSON string and converts it to a table. But why am I mentioning this here? Well, because a JSON string is comma-delimited, and it splits those elements into a table. Sounds just like what we are working with here.
The OPENJSON function accepts a string of a JSON expression, and an optional path into the string. We won’t be using the path parameter. For a simple example, let's try out the following statement:
SELECT * FROM OPENJSON('[1,2,3,4,5]');
Notice that a JSON expression has the square brackets, which I’ve added here. This statement returns the following result set:
key | value | type |
0 | 1 | 2 |
1 | 2 | 2 |
2 | 3 | 2 |
3 | 4 | 2 |
4 | 5 | 2 |
The key column is the name of the specified property, or the zero-based index position of the element in the array… something that is missing from the STRING_SPLIT function. The type column returns the JSON data type. For further details, see https://msdn.microsoft.com/en-us/library/dn921885.aspx.
That is how OPENJSON works with numbers… but it works differently for strings. If the following statement is run:
SELECT * FROM OPENJSON('[ABC,DEF,GHI]');
Then this error is returned:
Msg 13609, Level 16, State 4, Line 29
JSON text is not properly formatted. Unexpected character 'A' is found at position 1.
As it turns out, JSON string elements need to be delimited with a double-quote, like so:
SELECT * FROM OPENJSON('["ABC","DEF","GHI"]');
Therefore, in addition to needing to add the square brackets, double-quotes need to be inserted such that every element within the JSON expression is delimited with them. Furthermore, if the element includes a double-quote, it needs to be escaped using the forward-slash character "\".
The following code runs the above test against the OPENJSON function:
SELECT dt.*, ss4.value FROM (VALUES (N'' ,N'Empty String') ,(N' ' ,N'Blank Element') ,(NULL ,N'Null Element') ,(N'Single Element' ,N'Single Element') ,(N',Leading/Trailing,' ,N'Leading/Trailing Delimiter #1' ) ,(N',Leading,Trailing,' ,N'Leading/Trailing Delimiter #2' ) ,(N'Adjacent,,Delimiters' ,N'Adjacent Delimiters #1' ) ,(N'Adjacent,,,Delimiters',N'Adjacent Delimiters #2' ) ,(N',' ,N'Delimiters Only #1' ) ,(N',,' ,N'Delimiters Only #2' ) ,('Mr. & Mrs.' ,N'All characters' ) ,(N'"Embedded, Delimiter"',N'Embedded Delimiter') )dt (String,TestName) CROSS APPLY (VALUES ('["' + REPLACE(REPLACE(dt.String, '"', '\"'), ',', '","') + '"]')) ca(JSONSTR) CROSS APPLY OPENJSON(ca.JSONSTR) ss4;
This code returns the same result set as the STRING_SPLIT function. This makes me wonder if behind the scenes SQL Server uses the STRING_SPLIT function in the OPENJSON function – after all, why have two functions that split strings?
Creating Large Scale Test Data
The next thing to test is to compare the performance of these new functions against other known string splitters. For any performance test, you need to have some adequate test data. The following query, acquired from a SQLSaturday presentation from Jeff Moden, creates a test table with 10,000 rows of randomly delimited data.
--===================================================================================================================== -- Create the objects necessary to build the "ProdTable" table for the demo. --===================================================================================================================== --===== If the view that allows the use of NEWID() in a function already exists, drop it to make reruns in SSMS easier IF OBJECT_ID('dbo.iFunction') IS NOT NULL DROP VIEW dbo.iFunction ; --===== Create the view that allows the use of NEWID() in a function. GO CREATE VIEW dbo.iFunction AS /********************************************************************************************************************** Purpose: This view is callable from UDF's which allows us to indirectly get a NEWID() within a function where we can't do such a thing directly in the function. This view also solves the same problem for GETDATE(). Usage: SELECT MyNewID FROM dbo.iFunction; --Returns a GUID SELECT MyDate FROM dbo.iFunction; --Returns a Date Revision History: Rev 00 - 06 Jun 2004 - Jeff Moden - Initial creation Rev 01 - 06 Mar 2011 - Jeff Moden - Formalize code. No logic changes. **********************************************************************************************************************/SELECT MyNewID = NEWID(), MyDate = GETDATE(); GO --===== If the function that programmatically creates suitably variable CSV values already exists, -- drop it to make reruns in SSMS easier. IF OBJECT_ID('dbo.CreateCsv8K') IS NOT NULL DROP FUNCTION dbo.CreateCsv8K ; --===== Create the function that programatically creates suitably variable CSV values. GO CREATE FUNCTION dbo.CreateCsv8K /********************************************************************************************************************** Purpose: Create a CSV table result with a programmable number of rows, elements per row, minimum # of characters per element, and maximum characters per element. The element size is random in nature constrained by the min and max characters per element. Usage: SELECT * FROM dbo.CreateCsv8K(@pNumberOfRows, @pNumberOfElementsPerRow, @pMinElementwidth, @pMaxElementWidth) Dependencies: 1. View: dbo.iFunction (Produces a NEWID() usable from within a UDF) Programmer's Notes: 1. The randomness of the elements prevents the delimiters for showing up in the same position for each row so that SQL Server won't figure that out and cache the information making some splitting techniques seem faster than they really are. 2. No validation or constraints have been place on the input parameters so use with caution. This code can generate a lot of data in a couple of heart beats. Revision History: Rev 00 - 11 May 2007 - Jeff Moden - Initial creation - Only returned one row and wasn't programmable. Rev 01 - 26 Jul 2009 - Jeff Moden - Added programmable variables but would only go to 20 characters wide. Rev 02 - 06 Mar 2011 - Jeff Moden - Converted to iTVF, added minimum element width, and made it so elements can be virtually any size within 8k Bytes. Rev 03 - 14 Sep 2015 - Jeff Moden - Changed content of each element from a simple number to be much more complex and to include letters instead of just digits by using part of a NEWID(). **********************************************************************************************************************/--===== Declare the I/0 ( @pNumberOfRows INT, @pNumberOfElementsPerRow INT, @pMinElementwidth INT, @pMaxElementWidth INT ) RETURNS TABLE AS RETURN --===== This creates and populates a test table on the fly containing a -- sequential column and a randomly generated CSV Parameter column. SELECT TOP (@pNumberOfRows) --Controls the number of rows in the test table ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum, CSV = (--==== This creates each CSV SELECT CAST( STUFF( --=== STUFF get's rid of the leading comma ( --=== This builds CSV row with a leading comma SELECT TOP (@pNumberOfElementsPerRow) --Controls the number of CSV elements in each row ',' + LEFT(--==== Builds random length variable within element width constraints LEFT(REPLICATE((SELECT RIGHT(MyNewID,10) FROM dbo.iFunction) ,CEILING(@pMaxElementWidth/10.0)) ,@pMaxElementWidth), ABS(CHECKSUM((SELECT MyNewID FROM dbo.iFunction))) % (@pMaxElementWidth - @pMinElementwidth + 1) + @pMinElementwidth ) FROM sys.All_Columns ac3 --Classic cross join pseudo-cursor CROSS JOIN sys.All_Columns ac4 --can produce row sets up 16 million. WHERE ac3.Object_ID <> ac1.Object_ID --Without this line, all rows would be the same. FOR XML PATH('') ) ,1,1,'') AS VARCHAR(8000)) ) FROM sys.All_Columns ac1 --Classic cross join pseudo-cursor CROSS JOIN sys.All_Columns ac2 --can produce row sets up 16 million rows ; GO --=========================================================================== -- Create the stored procedure to create and populate the "ProdTable" -- table for the demo. -- Usage: EXEC dbo.CreateProdTable -- @pNumberOfRows INT --,@pNumberOfElementsPerRow INT --,@pMinElementwidth INT --,@pMaxElementWidth INT; --=========================================================================== --===== If the procedure already exists, drop it to make reruns in SSMS easier. IF OBJECT_ID('dbo.CreateProdTable','P') IS NOT NULL DROP PROCEDURE dbo.CreateProdTable ; GO CREATE PROCEDURE dbo.CreateProdTable @pNumberOfRows INT ,@pNumberOfElementsPerRow INT ,@pMinElementwidth INT ,@pMaxElementWidth INT AS --===== Create and populate the "ProdTable" table IF OBJECT_ID('dbo.ProdTable','U') IS NOT NULL DROP TABLE dbo.ProdTable; --===== Create the table CREATE TABLE dbo.ProdTable ( RowNum INT IDENTITY(1,1) ,String NVARCHAR(4000) ) ; /* DECLARE @RowCounter INTEGER = 0; WHILE @RowCounter < @pNumberOfRows BEGIN --*/ --===== Populate the table INSERT dbo.ProdTable (String) SELECT CSV FROM dbo.CreateCsv8K ( --1 @pNumberOfRows ,@pNumberOfElementsPerRow ,@pMinElementwidth ,@pMaxElementWidth ) WHERE RowNum > 0 --MUST be included to keep from creating "grooved" data. OPTION (QUERYTRACEON 9481) ; /* SET @RowCounter = @RowCounter + 1; END; --*/GO EXEC dbo.CreateProdTable 10000, 165, 1, 25; GO PRINT 'All Done' select * from dbo.Prodtable;
Note the “OPTION (QUERYTRACEON 9481)” line in the dbo.CreateProdTable procedure. SQL Server 2014 introduced a new cardinality estimator (CE 120), and it causes this query to return all of the rows with identical data. This trace flag makes the query statement utilize the older cardinality estimator (CE 70), which will create these rows with random data.
Now that we have some test data to work with, it’s time to create the test. One of the more important things to avoid is returning results to the client – this can introduce variables that can’t be controlled. The test code avoids this in two different methods. The first is by sending the results into a variable, which is overwritten with every row. A virtual bit bucket, if you will. The second way is to send the results into a temporary table. This second method to me is more like the real world - are you going to split strings, and just throw away the results? The first way does have the benefit of testing raw speed with minimal overhead.
This test harness will test each of these methods, five times each, storing the timing results into another temporary table. When the test is completed, it will return the results as an average for the function being tested and for which method of handling the results.
Next up: what functions are being tested? The first challenger to the STRING_SPLIT and OPENJSON functions is the DelimitedSplit8K function, written by Jeff Moden in this article http://www.sqlservercentral.com/articles/Tally+Table/72993/. The next challenger is the CLR function that Jeff used in that article. Both of these functions can be downloaded from the “Build all objects and run all tests.zip” file in the resources section of Jeff’s article.
Performance Test 1: Random-width, delimited, random data
The test code that I used is:
IF OBJECT_ID('tempdb.dbo.#Times') IS NOT NULL DROP TABLE #Times; CREATE TABLE #Times (RowID INTEGER, FunctionName sysname, ResultsTo VARCHAR(20), StartTime DATETIME, EndTime DATETIME NULL); DECLARE @RowID INTEGER = 0, @Loop INTEGER = 0, @BitBucket VARCHAR(MAX); WHILE @Loop < 5 BEGIN SET @Loop += 1; IF OBJECT_ID('tempdb.dbo.#test1') IS NOT NULL DROP TABLE #test1; IF OBJECT_ID('tempdb.dbo.#test2') IS NOT NULL DROP TABLE #test2; IF OBJECT_ID('tempdb.dbo.#test3') IS NOT NULL DROP TABLE #test3; IF OBJECT_ID('tempdb.dbo.#test4') IS NOT NULL DROP TABLE #test4; SET @RowID += 1; INSERT INTO #Times (RowID, FunctionName, ResultsTo, StartTime) VALUES (@RowID, 'DelimitedSplit8K', '@BitBucket', GETDATE()); SELECT @BitBucket = ds.Item FROM dbo.ProdTable CROSS APPLY dbo.DelimitedSplit8K(String, ',') ds UPDATE #Times SET EndTime = GETDATE() WHERE RowID = @RowID; SET @RowID += 1; INSERT INTO #Times (RowID, FunctionName, ResultsTo, StartTime) VALUES (@RowID, 'CLR_SPLIT', '@BitBucket', GETDATE()); SELECT @BitBucket = ss.Item FROM dbo.ProdTable CROSS APPLY dbo.Split(String, ',') ss UPDATE #Times SET EndTime = GETDATE() WHERE RowID = @RowID; SET @RowID += 1; INSERT INTO #Times (RowID, FunctionName, ResultsTo, StartTime) VALUES (@RowID, 'STRING_SPLIT', '@BitBucket', GETDATE()); SELECT @BitBucket = ss.value FROM dbo.ProdTable CROSS APPLY STRING_SPLIT(String, ',') ss UPDATE #Times SET EndTime = GETDATE() WHERE RowID = @RowID; SET @RowID += 1; INSERT INTO #Times (RowID, FunctionName, ResultsTo, StartTime) VALUES (@RowID, 'OPENJSON', '@BitBucket', GETDATE()); SELECT @BitBucket = ss.value FROM dbo.ProdTable CROSS APPLY (VALUES (REPLACE(REPLACE(String, '"', '\"'), ',', '","'))) ca(String) CROSS APPLY OPENJSON(ca.String) ss UPDATE #Times SET EndTime = GETDATE() WHERE RowID = @RowID; SET @RowID += 1; INSERT INTO #Times (RowID, FunctionName, ResultsTo, StartTime) VALUES (@RowID, 'DelimitedSplit8K', 'TempTable', GETDATE()); SELECT ds.Item INTO #test1 FROM dbo.ProdTable CROSS APPLY dbo.DelimitedSplit8K(String, ',') ds UPDATE #Times SET EndTime = GETDATE() WHERE RowID = @RowID; SET @RowID += 1; INSERT INTO #Times (RowID, FunctionName, ResultsTo, StartTime) VALUES (@RowID, 'CLR_SPLIT', 'TempTable', GETDATE()); SELECT ss.Item INTO #test2 FROM dbo.ProdTable CROSS APPLY dbo.Split(String, ',') ss UPDATE #Times SET EndTime = GETDATE() WHERE RowID = @RowID; SET @RowID += 1; INSERT INTO #Times (RowID, FunctionName, ResultsTo, StartTime) VALUES (@RowID, 'STRING_SPLIT', 'TempTable', GETDATE()); SELECT ss.value INTO #test3 FROM dbo.ProdTable CROSS APPLY STRING_SPLIT(String, ',') ss UPDATE #Times SET EndTime = GETDATE() WHERE RowID = @RowID; SET @RowID += 1; INSERT INTO #Times (RowID, FunctionName, ResultsTo, StartTime) VALUES (@RowID, 'OPENJSON', 'TempTable', GETDATE()); SELECT ss.value INTO #test4 FROM dbo.ProdTable CROSS APPLY (VALUES (REPLACE(REPLACE(String, '"', '\"'), ',', '","'))) ca(String) CROSS APPLY OPENJSON(ca.String) ss UPDATE #Times SET EndTime = GETDATE() WHERE RowID = @RowID; END; SELECT ResultsTo, AVG(ca.ElapsedMS) AS Average, FunctionName FROM #Times CROSS APPLY (VALUES (DATEDIFF(MILLISECOND, StartTime, EndTime))) ca(ElapsedMS) GROUP BY FunctionName, ResultsTo ORDER BY ResultsTo, AVG(ca.ElapsedMS);
The results of this performance test are:
ResultsTo | Average | FunctionName |
@BitBucket | 4512 | CLR_SPLIT |
@BitBucket | 4575 | STRING_SPLIT |
@BitBucket | 13534 | OPENJSON |
@BitBucket | 52164 | DelimitedSplit8K |
TempTable | 3186 | CLR_SPLIT |
TempTable | 9268 | STRING_SPLIT |
TempTable | 20257 | OPENJSON |
TempTable | 51925 | DelimitedSplit8K |
From these results, it can be seen that the new STRING_SPLIT function compares to the CLR function, however it trails slightly. However, both readily outperformed the OPENJSON function and the T-SQL based DelimitedSplit8K function.
Performance test 2: Random-width, delimited identical data
Do you remember that trace flag used to create random data? I rebuilt the test data without that trace flag (creating 10,000 rows of identical data), and then ran the test again. This produced the following results:
ResultsTo | Average | FunctionName |
@BitBucket | 569 | CLR_SPLIT |
@BitBucket | 4712 | STRING_SPLIT |
@BitBucket | 11427 | DelimitedSplit8K |
@BitBucket | 14512 | OPENJSON |
TempTable | 993 | CLR_SPLIT |
TempTable | 8416 | STRING_SPLIT |
TempTable | 10984 | DelimitedSplit8K |
TempTable | 20624 | OPENJSON |
The point of this is that the new STRING_SPLIT and OPENJSON functions ran for the same approximate time, while both of the others ran faster – and dramatically so.
Performance test 3: Fixed-width, delimited, random-data
When splitting strings for live data, I would normally expect more randomized data. However, I have seen occurrences of delimited fixed-width data (where the delimiters are in the same spot in all of the rows). I modified the test data to contain random, fixed width delimited data, and rerunning the test produces the following results:
ResultsTo | Average | FunctionName |
@BitBucket | 1934 | CLR_SPLIT |
@BitBucket | 3274 | STRING_SPLIT |
@BitBucket | 11560 | OPENJSON |
@BitBucket | 42808 | DelimitedSplit8K |
TempTable | 1484 | CLR_SPLIT |
TempTable | 4930 | STRING_SPLIT |
TempTable | 14234 | OPENJSON |
TempTable | 42580 | DelimitedSplit8K |
Again, it can be seen that the CLR function is still out-performing the new STRING_SPLIT function, though it has slowed somewhat. However, the DS8K function has degraded to nearly the original timings.
Functionalty test, part deux
Before you decide to abandon the new STRING_SPLIT and OPENJSON functions and just use the CLR function, let’s run the functionality test against the CLR splitter. Doing so produces these results:
String | TestName | ReturnValue |
Blank Element | " " | |
Single Element | Single Element | "Single Element" |
,Leading/Trailing, | Leading/Trailing Delimiter #1 | "" |
,Leading/Trailing, | Leading/Trailing Delimiter #1 | "Leading/Trailing" |
,Leading,Trailing, | Leading/Trailing Delimiter #2 | "" |
,Leading,Trailing, | Leading/Trailing Delimiter #2 | "Leading" |
,Leading,Trailing, | Leading/Trailing Delimiter #2 | "Trailing" |
Adjacent,,Delimiters | Adjacent Delimiters #1 | "Adjacent" |
Adjacent,,Delimiters | Adjacent Delimiters #1 | "" |
Adjacent,,Delimiters | Adjacent Delimiters #1 | "Delimiters" |
Adjacent,,,Delimiters | Adjacent Delimiters #2 | "Adjacent" |
Adjacent,,,Delimiters | Adjacent Delimiters #2 | "" |
Adjacent,,,Delimiters | Adjacent Delimiters #2 | "" |
Adjacent,,,Delimiters | Adjacent Delimiters #2 | "Delimiters" |
, | Delimiters Only #1 | "" |
,, | Delimiters Only #2 | "" |
,, | Delimiters Only #2 | "" |
Mr. & Mrs. | All characters | "Mr. & Mrs." |
"Embedded, Delimiter" | Embedded Delimiter | ""Embedded" |
"Embedded, Delimiter" | Embedded Delimiter | " Delimiter"" |
The first thing to note is that there are 6 fewer rows than expected… an indication that there is something wrong. Investigating further, two complete tests are not in the output; these are the NULL element and the Blank Element tests. The final missing results are from the four strings with trailing delimiters – they do not have the final element. This makes me wonder how the function would work with multiple leading / trailing delimiters in a row. A quick test:
SELECT * FROM dbo.Split(N',,,Hello,World,It''s,Me,,,', N',')
Shows that the only element missing is the last trailing delimiter.
Function Summary
Wrapping up, we have the T-SQL based DelimitedSplit8K function. It works with strings, but it is limited to 8000 characters, and it doesn’t handle Unicode data. It does return an element ordinal indicating the sequence of the element in the original string, as well as each delimited item. It also passes all of the tests. Identical data is processed significantly faster. However, it is the slowest of the four functions tested in this article.
The CLR function is overall the fastest. It also handles all char/nchar data, up to the max size limit of 2TB, and returns the element's ordinal position. However, it fails some of the tests. Tests with delimiters in the same position are processed significantly faster, with only a minor difference if they are identical.
The new STRING_SPLIT function also handles all char/nchar data up to the max size limit, and passes most of the tests. The one test that it fails can be compensated for with the OUTER APPLY operator. It is significantly faster than the DelimitedSplit8K function. There is only a slight improvement with delimiters in the same position. However, it does not have the element ordinal position.
All of these functions only work with a separator of a single character.
Finally, the new OPENJSON function also handles all char/nchar data, and passes most of the tests. As with the STRING_SPLIT function, the one test that it fails can be compensated for with the OUTER APPLY operator. Performance wise, it falls between the DelimitedSplit8K function and the STRING_SPLIT function, and if all of the delimiters are in the same position there is a performance improvement. Additionally, it does have the element’s ordinal position, albeit it is zero-based. However, it only works with comma-delimited strings, and if you aren’t using numerics, then you need to ensure that each element is delimited with double-quotes (and escaping existing ones). Furthermore, you need to make this a proper JSON expression by including the square brackets.
Following is a quick comparison chart to show the different features of the functions.
Test | DS8K | CLR | STRING_SPLIT | OPENJSON |
Max Character Data | 8000 | MAX | MAX | MAX |
Max Unicode Data | 4000 (with DSN4K) | MAX | MAX | MAX |
Passes functionality test | Yes | No (missing NULL, blank inputs and trailing delimiters) | No (missing NULL input, can use OUTER APPLY to compensate) | No (missing NULL input, can use OUTER APPLY to compensate) |
Results include element's ordinal position | Yes | Yes | No | Yes (zero-based) |
Delimiter size | 1 | 1 | 1 | 1 |
Can specify delimiter? | Yes | Yes | Yes | No (comma only) |
Special handling of input strings? | No | No | No | Strings must be delimited with double-quotes; embedded double-quotes in string must be escaped. Must make string a valid JSON expression by introducing square brackets. |
Conclusion
Q: What’s the best way to split strings then?
A: It depends!
If you need the element’s ordinal position, then the new STRING_SPLIT function is not a contender. Otherwise, it handles all of the other functionality tests (or it can be handled with the OUTER APPLY operator), and it is significantly faster than the DelimitedSplit8K function. All of the other functions have the ordinal position.
If you need to work with Unicode data, the DelimitedSplit8K function can be adapted to work with it (there is a DelimitedSplitN4K function available for this). The other functions all work with Unicode data.
If you need to work with data of a max data type, then the DelimitedSplit8K function cannot be used (the performance with a max data type is considerably worse). All of the other functions can work with a max data type.
If your strings won’t have trailing delimiters, and they won’t have empty strings, then the CLR function would be an excellent choice. However, can you guarantee this won’t be used somewhere where your data would have this type of strings? Even in the future, where someone might say "Oh, we already have this string splitting routine, I'll just use that" - without testing it thoroughly? I wouldn’t chance it.
If you are working with comma-delimited numbers only, then the OPENJSON function is a strong contender. If working with strings, then you need additional code to handle making the string a proper JSON expression for strings. Either way, it also requires adding the square brackets to make a valid JSON expression. Even with this extra code, it still out-performs the DelimitedSplit8K function. Personally, I feel that this is too many differences for how to use this function for this to be in your string-splitting toolbox. Furthermore, if your data isn’t already comma-delimited, then you will need to convert it prior to being used with this function. With faster methods available, I really can’t recommend this function.
Therefore, my recommendation comes down to whether or not you need the element’s ordinal position – if so, then use the DelimitedSplit8K function; if not, use the STRING_SPLIT function - just be sure to use it with an OUTER APPLY to return a row when a NULL input is passed to it.
It is my opinion that the new STRING_SPLIT function should not be slower than the CLR function. Microsoft should have been able to create a solution that is at least as fast as that.
I’ve brought up a few times about the element’s ordinal position. This would be necessary if the elements need to be ordered in the order that they are listed in the string – which is a frequent occurrence. There is a connect item for adding this feature to the STRING_SPLIT function at https://connect.microsoft.com/SQLServer/Feedback/Details/2433171.
A special thanks to Jeff Moden for allowing me to repost his code here.
References: