I’ve been developing T-SQL for about 20 years now and almost since day one there has been functionality that , IMO, has been missing. That is grouped string concatenation (ie forming a delimited string ) and string split ( breaking up a delimited string). When I say missing, I really mean some form of succinct functionality that can do it and not some form of hideous kludge that can cause a religious argument between various members of the community about why their method is better than any one else’s
My hopes were not high for SQL Server 2016 fixing this either when this article https://blogs.msdn.microsoft.com/sqlserverstorageengine/2015/09/08/passing-arrays-to-t-sql-procedures-as-json/ was published. I’ve never been convinced by complex (ie XML) datatypes within the relational database, there are better ways if that is what you have to do, but , <sarcasm> yippee </sarcasm>, with JSON baked in here is another method of string splitting rather than doing it properly. Honestly, it really does make me want to cry.
But, with the release of SQL Server RC0, snuck into the release notes was
Woohoo, a STRING_SPLIT function, finally at last, lets break out the bunting.
So, lets have a quick investigation into how to use it and are there any caveats around it.
I wont repeat the official documentation https://msdn.microsoft.com/en-gb/library/mt684588.aspx but to recap, using it is really simple STRING_SPLIT(string,delimiter) where string is the delimited string to split and delimiter to split that up on. Notably, delimiter can only be a single char, that’s not a deal breaker by any means, but could be an annoyance.
Right, I’m now assuming that you are familiar with the official docs, so is there anything else to know about this that function before we go forward and immediately change all our previous kludges with this new function.
Firstly anyone that knows me will know that I’m interested in estimations ( and the wider optimization area) , so the first thing is , how many rows are estimated to be returned from the TVF ?
Here is a script to test this :
Create Table Strings
(
String varchar(max)
)
go
insert into Strings values('one,two,three')
go
Select *
from Strings
Cross Apply string_split(strings.String,',')
The plan is as you would expect from a CLR TVF, in fact I suspect that this is just an CLR TVF implementation but i’m not gonna state that as such.
and the detail for the TVF ?
50, the estimate of the rows outputted is 50. After a number of tests this hasn’t changed. Inline with similar TVF functionality, this isn’t surprising news, but i was curious as to what the number was.
Secondly, as I’m suspecting that this is just CLR, then i’ve found in the past that the optimizer plays ‘safe’ and as the TVF functions are opaque to the optimizer, then Halloween protection is implemented to protect against the possibility that the same data is being read as is being updated.
Does this happen here ? Test code is:
Create Table Strings
(
String varchar(max)
)
go
Create Table Dest
(
String varchar(max)
)
go
insert into Strings values('one,two,three')
go
Insert into Dest(String)
Select Value
from Strings
Cross Apply string_split(strings.String,',')
Notice that i’m inserting to a different table.
And….
Awesome, a table spool, so the same does apply here, we have Halloween protection, on large resultsets that spooling is going to hurt.
If you need to insert the result somewhere, then SELECT INTO wont need this as it knows you are creating a new table so there is no chance of a double update. But, if you SELECT INTO a table and then INSERT those to the table you intended to in the first place, then that is effectively the same thing as a spool so more than likely not worth the effort.
Finally to finish this blog off, here is a query with a parallel plan :
SELECT sod.SalesOrderDetailID,sod.ModifiedDate,p.Style
FROM Sales.SalesOrderDetail sod
INNER JOIN Production.Product p ON sod.ProductID = p.ProductID
--cross apply string_split(style,',')
ORDER BY Style
Here is a query with a serial plan :
SELECT sod.SalesOrderDetailID,sod.ModifiedDate,p.Style
FROM Sales.SalesOrderDetail sod
INNER JOIN Production.Product p ON sod.ProductID = p.ProductID
cross apply string_split(style,',')
ORDER BY Style
Right <sigh> , so it inhibits a parallel plan being built, sheesh. Also FYI, there is no NonParallelPlanReason specified in the plan.
That’ll do for now, but i will be running some performance tests soon to see if it is worth the trouble.
UPDATE : 2016-03-21 For performance tests, Aaron Bertrand has done a comprehensive write up.