An Overview of STRING_SPLIT()

Question

An Overview of STRING_SPLIT()

Steve Jones - SSC Editor

SSC Guru

Points: 736212
More actions
December 9, 2018 at 9:44 pm

#409371

Comments posted to this topic are about the item An Overview of STRING_SPLIT()

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

philcart SSC-Forever Points: 47794 More actions · Answer 1

Nice article Steve

Would be interested in seeing how the new function stacks up against the "custom" split functions that every DBA seems to have a few versions of.

Cheers
Phil

--------------------
Colt 45 - the original point and click interface

Eirikur Eiriksson SSC Guru Points: 182911 More actions · Answer 2

Thanks for the write-up Steve.
😎
One thing, there are performance references but no mention of the fastest non CLR string splitting function (on SQL Server 2012 and later).

Steve Jones - SSC Editor SSC Guru Points: 736212 More actions · Answer 3

Eirikur Eiriksson - Sunday, December 9, 2018 11:36 PM
Thanks for the write-up Steve.
😎
One thing, there are performance references but no mention of the fastest non CLR string splitting function (on SQL Server 2012 and later).

Added a link

Steve Jones - SSC Editor SSC Guru Points: 736212 More actions · Answer 4

philcart - Sunday, December 9, 2018 10:46 PM
Nice article Steve
Would be interested in seeing how the new function stacks up against the "custom" split functions that every DBA seems to have a few versions of.
Cheers
Phil

If you look at the reference to Aaron Bertrand's article, he rates performance in different ways. STRING_SPLIT() stacks up nicely.

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 5

Steve Jones - SSC Editor - Monday, December 10, 2018 7:03 AM
philcart - Sunday, December 9, 2018 10:46 PM
Nice article Steve
Would be interested in seeing how the new function stacks up against the "custom" split functions that every DBA seems to have a few versions of.
Cheers
Phil
If you look at the reference to Aaron Bertrand's article, he rates performance in different ways. STRING_SPLIT() stacks up nicely.

If you look at Aaron Bertrand's article, he comes to the conclusion that the XML method for splitting is as good as the DemilitedSplit8K method. Nothing could be further from the truth. The problem is with the "Devils in the Data" (I've gotta finish the articles on that subject for you soon). He creates "grooved" test data that has a cardinality of "1". That seriously skews the test results because the XML method can take advantage of such low cardinalities for a large set. But, if you have a table of, say, 10,000 rows with a cardinality of 10,000, the XML splitter method becomes a huge performance issue.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Igor Zaychik SSC Veteran Points: 246 More actions · Answer 6

Thank you for sharing good news.
From other side I used to use KNIME free analytical platform for data manipulation
and most of the time it is necessary to add columns with the values than lines.

Eirikur Eiriksson SSC Guru Points: 182911 More actions · Answer 7

Jeff Moden - Monday, December 10, 2018 8:24 AM
If you look at Aaron Bertrand's article, he comes to the conclusion that the XML method for splitting is as good as the DemilitedSplit8K method. Nothing could be further from the truth. The problem is with the "Devils in the Data" (I've gotta finish the articles on that subject for you soon). He creates "grooved" test data that has a cardinality of "1". That seriously skews the test results because the XML method can take advantage of such low cardinalities for a large set. But, if you have a table of, say, 10,000 rows with a cardinality of 10,000, the XML splitter method becomes a huge performance issue.

+100
😎

Eirikur Eiriksson SSC Guru Points: 182911 More actions · Answer 8

Steve Jones - SSC Editor - Monday, December 10, 2018 7:03 AM
Eirikur Eiriksson - Sunday, December 9, 2018 11:36 PM
Thanks for the write-up Steve.
😎
One thing, there are performance references but no mention of the fastest non CLR string splitting function (on SQL Server 2012 and later).
Added a link

Thanks Steve, don't think there are too many (poor buggers) on 2008 or earlier,
😎

Eirikur Eiriksson SSC Guru Points: 182911 More actions · Answer 9

Jeff Moden - Monday, December 10, 2018 8:24 AM
If you look at Aaron Bertrand's article, he comes to the conclusion that the XML method for splitting is as good as the DemilitedSplit8K method. Nothing could be further from the truth. The problem is with the "Devils in the Data" (I've gotta finish the articles on that subject for you soon). He creates "grooved" test data that has a cardinality of "1". That seriously skews the test results because the XML method can take advantage of such low cardinalities for a large set. But, if you have a table of, say, 10,000 rows with a cardinality of 10,000, the XML splitter method becomes a huge performance issue.

AB's article isn't bad but it's incomplete and hasn't been appropriately updated.
😎
One thing AB missed entirely is the efficiency of fixed width parameter strings, blows the chops of any other methods by a considerable margin (must finish that article)

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 10

Eirikur Eiriksson - Monday, December 10, 2018 9:09 AM
AB's article isn't bad but it's incomplete and hasn't been appropriately updated.
😎
One thing AB missed entirely is the efficiency of fixed width parameter strings, blows the chops of any other methods by a considerable margin (must finish that article)

Sorry but have to disagree there. I did try to coach Aaron publically and privately and, instead, he simply stated he wasn't going to do any additional testing and closed the article to any further comment. Anyone reading it can easily come to the same wrong conclusion that he did because of the "Holy Grail" nature (an article that contains code to test functionality and performance as well as having charts that "prove" the findings, which are horribly wrong in that case) of his article and people end up drinking from the wrong cup and "dying" when the deploy such poor methods to a production environment. even worse, his article didn't take into account any special characters and the XML code he published causes a failure there..

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 11

Eirikur Eiriksson - Monday, December 10, 2018 9:09 AM
Jeff Moden - Monday, December 10, 2018 8:24 AM
If you look at Aaron Bertrand's article, he comes to the conclusion that the XML method for splitting is as good as the DemilitedSplit8K method. Nothing could be further from the truth. The problem is with the "Devils in the Data" (I've gotta finish the articles on that subject for you soon). He creates "grooved" test data that has a cardinality of "1". That seriously skews the test results because the XML method can take advantage of such low cardinalities for a large set. But, if you have a table of, say, 10,000 rows with a cardinality of 10,000, the XML splitter method becomes a huge performance issue.
AB's article isn't bad but it's incomplete and hasn't been appropriately updated.
😎
One thing AB missed entirely is the efficiency of fixed width parameter strings, blows the chops of any other methods by a considerable margin (must finish that article)

If you consider the conclusions reached because of the incorrect test data used, it's bad.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

JJ B SSCertifiable Points: 5130 More actions · Answer 12

Eirikur Eiriksson - Monday, December 10, 2018 9:05 AM
Steve Jones - SSC Editor - Monday, December 10, 2018 7:03 AM
Eirikur Eiriksson - Sunday, December 9, 2018 11:36 PM
Thanks for the write-up Steve.
😎
One thing, there are performance references but no mention of the fastest non CLR string splitting function (on SQL Server 2012 and later).
Added a link
Thanks Steve, don't think there are too many (poor buggers) on 2008 or earlier,
😎

Our shop is still one of the few "poor buggers" on 2008 R2. (Sigh.) Working on management for a while now...

Just sharing. 😉 This has been a really helpful article and discussion. Thank you all. I'm thinking positively that we will upgrade soon, and I will be able to start using all the new, helpful functionality.

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 13

Here's an idea that I've been kicking around for a little while now but haven't had the chance to torture test yet...
It's one of those "not guaranteed to work every time... but... seems to work every time" kind of deals.
The idea is simple. Just encapsulate ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) with the STRING_SPLIT function so that THE ROW_NUMBER() is able to register the rows as they are coming out of the STRING_SPLIT before any other part of the query can affect the ordering..

Like I said, I haven't had a chance to really go nuts trying to break it, but it has stood up the few things I have tried. I figured I'd throw it out and see if anyone wants to take a crack at proving that an idea that shouldn't, doesn't work... (because so far it does)

The original idea...
SET QUOTED_IDENTIFIER ON; GO SET ANSI_NULLS ON; GO

ALTER FUNCTION dbo.SPLIT_STRING_RNv1 /* =============================================================================== 12/10/2018 JL, Created: Adds a ROW_NUMBER() to the built in STRING_SPLIT function. IMPORTANT! Fo STRING_SPLIT BOL: "The order is not guaranteed to match the order of the substrings in the input string." Just becuase this trick works "most of the time" it is not guaranteed to work every time. DO NOT use in a business critical situation where the original order is critical. =============================================================================== */ --===== Define I/O parameters ( @string NVARCHAR(MAX), @separator NCHAR(1) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), ss.[Value] FROM STRING_SPLIT(@string, @separator) ss; GO

This one was more of an odd-nut idea that popped into my head after recalling that using the TOP w/ ORDER BY forces an internal sort to handle the TOP(N) criteria, even when N is large enough to handle any result set. I have absolutely no idea if it's doing anything at all... I haven't been able to get the 1st one to screw up to see if this one does any better...
SET QUOTED_IDENTIFIER ON; GO SET ANSI_NULLS ON; GO

ALTER FUNCTION dbo.SPLIT_STRING_RNv2 /* =============================================================================== 12/10/2018 JL, Created: Adds a ROW_NUMBER() to the built in STRING_SPLIT function. This is an alternate version that adds a TOP & ORDER BY to the query in order to see if it behaves differently. IMPORTANT! Fo STRING_SPLIT BOL: "The order is not guaranteed to match the order of the substrings in the input string." Just becuase this trick works "most of the time" it is not guaranteed to work every time. DO NOT use in a business critical situation where the original order is critical. =============================================================================== */ --===== Define I/O parameters ( @string NVARCHAR(MAX), @separator NCHAR(1) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT TOP (2147483647) rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), ss.[Value] FROM STRING_SPLIT(@string, @separator) ss ORDER BY rn; GO

This 3rd option is based on good idea from Aaron Bertrand. It's guaranteed to work as long as your input string is composed of unique values and you're willing to eat the resulting sort operation.
It works by using CHARINDEX to find the Values position on the original input string...
If you want to see the conversation that prompted me to actually start testing the idea (or read Aaron's thoughts on the idea [spoiler: not a fan]), Here's a link: https://stackoverflow.com/questions/53562331/selecting-data-against-numeric-values-saved-as-comma-separated-string/53563089#53563089
SET QUOTED_IDENTIFIER ON; GO SET ANSI_NULLS ON; GO

CREATE FUNCTION dbo.SPLIT_STRING_RNvAB /* =============================================================================== 12/10/2018 JL, Created: Adds a ROW_NUMBER() to the built in STRING_SPLIT function. Based on an Aaron Butyrand's idea of using CHARINDEX to find the items location in the original string. Note: Array items must be DISTINCT for this approach to work properly. =============================================================================== */ --===== Define I/O parameters ( @string NVARCHAR(MAX), @separator NCHAR(1) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT rn = ROW_NUMBER() OVER (ORDER BY value_position), ss.[Value] FROM STRING_SPLIT(@string, @separator) ss CROSS APPLY ( VALUES (CHARINDEX(@separator + ss.Value + @separator, @separator + @string + @separator)) ) vp (value_position); GO

Here's a simple test script I threw together...
CREATE TABLE #TestData (Num INT NOT NULL PRIMARY KEY); -- Adding the clustered key to impose a "natural sort" that causes the stock INSERT #TestData (Num) VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); --STRING_SPLIT to not display in the order of the original string.

DECLARE @string VARCHAR(8000) = '9,6,3,1,4,8,5,6,1,0,2';

-- original STRING_SPLIT function... SELECT td.Num, ss.value FROM STRING_SPLIT(@string, ',') ss JOIN #TestData td ON CONVERT(INT, ss.Value) = td.Num;

-- ROW_NUMBER() alone SELECT td.Num, ssr.rn, ssr.Value FROM dbo.SPLIT_STRING_RNv1(@string, ',') ssr JOIN #TestData td ON CONVERT(INT, ssr.Value) = td.Num

-- ROW_NUMBER() w/ TOP & ORDER BY SELECT td.Num, ssr.rn, ssr.Value FROM dbo.SPLIT_STRING_RNv2(@string, ',') ssr JOIN #TestData td ON CONVERT(INT, ssr.Value) = td.Num;

-- ROW_NUMBER() with order set by CHARINDEX SELECT td.Num, ssr.rn, ssr.Value FROM dbo.SPLIT_STRING_RNvAB(@string, ',') ssr JOIN #TestData td ON CONVERT(INT, ssr.Value) = td.Num;

DROP TABLE #TestData;

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 14

Lot's of people have been doing that trick with RowNumber. Thinking about the code behind the scenes, it should be guaranteed but MS refuses to state that or even imply that. It's really a shame they didn't take the time to return the element ordinal.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)