Intro
One day while playing with Solomon Rutzky's SQL# String_FixedWidthSplit I thought, why not crerate a purely set-based T-SQL fixed width splitter? Enter NSplit8K and NSplit2B. Both splitters are identical except that the 8K version handles varchar(8000) and the 2B version (2B for two billion*) handles varchar(max). The 8K version is faster but the 2B version handles strings longer than 8,000 characters. For situations where you need to split a CSV and you know that each token (AKA item) is the same size, you don't need a traditional "splitter" for the job, instead you can use a fixed width splitter instead. As you will see from the examples below, you can use these functions for more than just your traditional "string splitting" tasks.
Details
Each splitter takes three parameters:
Usages Examples
Example 1: Split the string, "ab,cd,ef,gg" into tokens
Here's how you would split a string when you know that each token is two characters long and the delimiter is one character long.
SELECT ItemNumber, Token FROM dbo.NSplit8K('ab,cd,ef,gg',2,1);
ItemNumber Token
-------------------- --------------------
1 ab
2 cd
3 ef
4 gg
Example 2: Extract the values "client01", "client02", etc... from string below
This is an example of how you could parse values from an XML fragment where the elements are the same size.
DECLARE @string varchar(100) = '<client01>,<client02>,<client03>,<client04>'; SELECT ItemNumber, Token FROM dbo.NSplit8K(SUBSTRING(@string,2,LEN(@string)-2),8,3);
Results
ItemNumber Token
-------------------- --------------------
1 client01
2 client02
3 client03
4 client04
Example 3: Nsplit into an N-Gram function
Here we're using NSplit to create an N-Gram function, a very powerful text mining tool.
DECLARE @string varchar(8000) = 'abcdefghi', @N int = 1; -- change this to 2 for bigrams, 3 for trigrams, 4 for 4-grams, etc... SELECT TOP (LEN(@string)-(@N-1)) ItemNumber, Token FROM dbo.NSplit8K('abcdefghi',@N,-(@N-1));
Results when running the above code with @N = 1
ItemNumber Token
-------------------- --------------------
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
When @N = 2
ItemNumber Token
-------------------- --------------------
1 ab
2 bc
3 cd
4 de
5 ef
6 fg
7 gh
8 hi
When @N = 3
ItemNumber Token
-------------------- --------------------
1 abc
2 bcd
3 cde
4 def
5 efg
6 fgh
7 ghi
Notice that, in the above N-Gram examples, we're not just returning the N-sized token, we're also returning it's position in the string. Below are the two functions. Happy text manipulating!
NSplit8K - The varchar(8000) version
CREATE FUNCTION dbo.NSplit8K ( @String varchar(8000), @TokenLen bigint, @DelimLen bigint ) /**************************************************************************************** Purpose: Splits an input string (@String) into @TokenLen sized tokens with @DelimLen sized gaps between each token. Intended to be a better performing alternative to the traditional "splitter" for cases where the size of the token (AKA "item") is always the same. Compatibility: SQL Server 2008+, Azure SQL Database, Azure SQL Data Warehouse & Parallel Data Warehouse Syntax: SELECT ItemNumber, Token FROM dbo.NSplit8K(@string, @length, @delimiterLength); Parameters: @String = varchar(8000); the input string to "split" into tokens @TokenLen = bigint; the size of the output token @DelimLen = bigint; the size of the delimiter. Can be set to 0 for when there's no delimiter or set to negetive if you desire overlap Returns: ItemNumber = bigint; represents the order that the token appears in the input string Token = varchar(8000); the @lengh-sized token returned by the function. Developer notes: 1. Requires GetNumsAB: http://www.sqlservercentral.com/scripts/Set+Based/139370/ Alternatively you can incorporate your own tally table logic at your own risk. 2. If the last token is shorter then @TokenLen then it will be truncated. 3. Tends to perform better with a serial execution plan; it may be adventageous to test with OPTION (MAXDOP 1) and compare that performance with a parallel plan. To force a parallel execution plan see make_parallel by Adam Machanic: sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx 4. There are no safeguards built into the function to make sure the tokens are all the same size. The only safeguard is to test and know your data. 5. If any of the parameters are NULL the function will return a a single NULL value. 6. NSplit8K is deterministic. For more about deterministic functions see: https://msdn.microsoft.com/en-us/library/ms178091.aspx Examples: --===== (1) Extract values "ab", "cd", "ef", and "gg"; token-size=2; delimiter = "," SELECT ItemNumber, Token FROM dbo.NSplit8K('ab,cd,ef,gg',2,1); --===== (2) Extract values "xxx", "yyy", "bbb", etc... token-size=3; delimiter = "," DECLARE @string1 varchar(100) = 'xxx,yyy,bbb,123,zzz'; SELECT ItemNumber, Token FROM dbo.NSplit8K(@string1,3,1); --===== (3) extract the values "client01", "client02", etc... DECLARE @string varchar(100) = '<client01>,<client02>,<client03>,<client04>'; SELECT ItemNumber, Token FROM dbo.NSplit8K(SUBSTRING(@string,2,LEN(@string)-2),8,3); --===== (4) If you dont need a delimiter to split the string SELECT ItemNumber, Token FROM dbo.NSplit8K('abcdefghi',3,0); --===== (5) Turn it into an N-Gram function DECLARE @string varchar(8000) = 'abcdefghi', @N int = 1; -- change this to 2 for bigrams, 3 for trigrams, 4 for 4-grams, etc... SELECT TOP (LEN(@string)-(@N-1)) ItemNumber, Token FROM dbo.NSplit8K('abcdefghi',@N,-(@N-1)); --===== (6) Split a string into unigrams then count the vowels SELECT VowelCount = COUNT(*) FROM dbo.NSplit8K('abcdefghi',1,0) WHERE Token LIKE '[aeiou]'; --===== (7) Behavior for NULL inputs SELECT ItemNumber, Token FROM dbo.NSplit8K(NULL,1,0); SELECT ItemNumber, Token FROM dbo.NSplit8K('ABC',NULL,0); SELECT ItemNumber, Token FROM dbo.NSplit8K('ABC',1,NULL); ---------------------------------------------------------------------------------------- Rev 00 - 20151030 - Alan Burstein - Initial Development Rev 01 - 20160407 - Alan Burstein - Rewrote using getnumsAB - Added logic for handling NULL inputs ****************************************************************************************/RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT ItemNumber = rn, Token = SUBSTRING(@string,n1,@TokenLen) FROM dbo.GetNumsAB(1,CONVERT(bigint,DATALENGTH(@string),0),@TokenLen+@DelimLen,1) WHERE @string IS NOT NULL AND @TokenLen+@DelimLen IS NOT NULL UNION ALL SELECT 1, NULL -- if any of the parameters are NULL then return a single NULL value: WHERE REPLICATE(@string,@TokenLen+@DelimLen) IS NULL;
NSplit2B - The varchar(max) version
CREATE FUNCTION dbo.NSplit2B ( @String varchar(max), @TokenLen bigint, @DelimLen bigint ) /**************************************************************************************** Purpose: Identical to NSplit8K but handles an input string of data type varchar(max). NSplit2B splits an input string (@String) into @TokenLen sized tokens with @DelimLen sized gaps between each token. Intended to be a better performing alternative to the traditional "splitter" for cases where the size of the token (AKA "item") is always the same. Compatibility: SQL Server 2008+, Azure SQL Database, Azure SQL Data Warehouse & Parallel Data Warehouse Syntax: SELECT ItemNumber, Token FROM dbo.NSplit2B(@string, @length, @delimiterLength) Parameters: @String = varchar(max); the input string to "split" into tokens @TokenLen = bigint; the size of the output token @DelimLen = bigint; the size of the delimiter. Can be set to 0 for when there's no delimiter or set to negetive if you desire overlap Returns: ItemNumber = bigint; represents the order that the token appears in the input string Token = varchar(max); the @lengh-sized token returned by the function. Developer notes: 1. Requires GetNumsAB: http://www.sqlservercentral.com/scripts/Set+Based/139370/ Alternatively you can incorporate your own tally table logic at your own risk. 2. If the last token is shorter then @TokenLen then it will be truncated. 3. Tends to perform better with a serial execution plan; it may be adventageous to test with OPTION (MAXDOP 1) and compare that performance with a parallel plan. To force a parallel execution plan see make_parallel by Adam Machanic: sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx 4. There are no safeguards built into the function to make sure the tokens are all the same size. The only safeguard is to test and know your data. 5. If any of the parameters are NULL the function will return a a single NULL value. 6. Will perform slower than NSplit8K due to the fact that it's handling varchar(max) strings. If you're sure all strings contain 8,000 characters or less use NSPlit8K. 7. NSplit2B is deterministic. For more about deterministic functions see: https://msdn.microsoft.com/en-us/library/ms178091.aspx Examples: --===== (1) extract the values "ab", "cd", "ef", and "gg"; token-size=2; delimiter="," SELECT ItemNumber, Token FROM dbo.NSplit2B('ab,cd,ef,gg',2,1); --===== (2) extract the values "xxx", "yyy", "bbb", etc... token-size=3; delimiter="," DECLARE @string1 varchar(100) = 'xxx,yyy,bbb,123,zzz'; SELECT ItemNumber, Token FROM dbo.NSplit2B(@string1,3,1); --===== (3) extract the values "client01", "client02", etc... DECLARE @string varchar(100) = '<client01>,<client02>,<client03>,<client04>'; SELECT ItemNumber, Token FROM dbo.NSplit2B(SUBSTRING(@string,2,LEN(@string)-2),8,3); --===== (4) If you dont need a delimiter to split the string SELECT ItemNumber, Token FROM dbo.NSplit2B('abcdefghi',3,0); --===== (5) Turn it into an N-Gram function DECLARE @string varchar(max) = 'abcdefghi', @N int = 1; -- change this to 2 for bigrams, 3 for trigrams, 4 for 4-grams, etc... SELECT TOP (LEN(@string)-(@N-1)) ItemNumber, Token FROM dbo.NSplit2B('abcdefghi',@N,-(@N-1)); --===== (6) Split a string into unigrams then count the vowels SELECT VowelCount = COUNT(*) FROM dbo.NSplit2B('abcdefghi',1,0) WHERE Token LIKE '[aeiou]'; --===== (7) Behavior for NULL inputs SELECT ItemNumber, Token FROM dbo.NSplit2B(NULL,1,0); SELECT ItemNumber, Token FROM dbo.NSplit2B('ABC',NULL,0); SELECT ItemNumber, Token FROM dbo.NSplit2B('ABC',1,NULL); ---------------------------------------------------------------------------------------- Rev 00 - 20151030 - Alan Burstein - Initial Development Rev 01 - 20160407 - Alan Burstein - Rewrote using getnumsAB - Added logic for handling NULL inputs ****************************************************************************************/RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT ItemNumber = rn, Token = SUBSTRING(@string,n1,@TokenLen) FROM dbo.GetNumsAB(1,CONVERT(bigint,DATALENGTH(@string),0),@TokenLen+@DelimLen,1) WHERE @string IS NOT NULL AND @TokenLen+@DelimLen IS NOT NULL UNION ALL SELECT 1, NULL -- if any of the parameters are NULL then return a single NULL value WHERE REPLICATE(@string,@TokenLen+@DelimLen) IS NULL;