September 25, 2014 at 6:50 pm
It is not a bug (in SQL Server, the T-SQL code has a bug, of course).
There is always a temptation to read T-SQL as a sequence of steps, particularly where a string of CTEs is present, or there are multiple scopes e.g. a subquery inside an IN list. It is only natural to expect execution to follow the 'clearly intended' sequence of operations, but this is not how a declarative language like T-SQL works. In particular, scalar expressions may be evaluated earlier or later than expected, and they may even be executed more times than expected (if the expression is deterministic).
In the present case, an expression evaluated at a Filter performs an implicit conversion to uniqueidentifier as part of this test:
@tbl.[id] as [t1].[id]=CONVERT_IMPLICIT(uniqueidentifier,substring([@pString],CONVERT_IMPLICIT(int,[Expr1052]+(1),0),CONVERT_IMPLICIT(int,isnull(CONVERT_IMPLICIT(bigint,CASE WHEN charindex([@pDelimiter],[@pString],CONVERT_IMPLICIT(int,[Expr1052]+(1),0))=(0) THEN NULL ELSE charindex([@pDelimiter],[@pString],CONVERT_IMPLICIT(int,[Expr1052]+(1),0)) END,0)-([Expr1052]+(1)),(8000)),0)),0) AND substring([@pString],CONVERT_IMPLICIT(int,[Expr1052],0),(1))=[@pDelimiter]
Simplifying, the Filter predicate is essentially:
[id] = CONVERT(uniqueidentifier, [expression1])
AND [expression2] = [@pDelimiter]
Clearly, the right hand side of the AND is concerned with the current position in the string being a delimiter. There are no guarantees that the left hand side of the AND (with the convert to uniqueidentifier) will be evaluated before the right hand side, but it is quite likely. Where the current position is not a delimiter, the left hand side is unlikely to be convertible to uniqueidentifier.
There are ways to 'improve' the split function (e.g. using CASE to get some guarantees about expression evaluation order) but it seems much simpler to use an intermediate table or TRY_CONVERT, if available.
There are arguments to be made about optimization not introducing type-conversion errors that logically could not happen in the original T-SQL, but this is not a clear-cut case of that. In any event, that's not how SQL Server works today.
September 26, 2014 at 12:18 am
stefan.mulder (9/24/2014)
Great stuff!but as to Jeff's notes on collation: we do need a function that can handle compound delimiters, e.g. [|;] or [","] as in David Data's case of ["John Smith","23, The High Street","Sometown"].
This is not so difficult. Just change the type of pDelimiter to varchar(3) (I don't think you'll never need a delimiter longer than that) and change the definition of cteStart(N1) to:
cteStart(N1) AS (--==== This returns N+DATALENGTH(@pDelimiter)
SELECT 1 UNION ALL
SELECT t.N+DATALENGTH(@pDelimiter)
FROM cteTally t
WHERE SUBSTRING(@pString,t.N,DATALENGTH(@pDelimiter)) = @pDelimiter
),
Now we can solve David Data's case, by using [","] as delimiter and removing the initial and final double quotes from the item's returned, since such quotes cannot appear within the quoted strings:
select ItemNumber, REPLACE(Item, '"', '')
from dbo.DelimitedSplit8k('"John Smith","23, The High Street","Sometown"','","')
There's no question that you can modify the code to handle multi-character delimiters. The problem in the example given is that you (and you said this) actually have 2 sets of delimiters... the single-quote at each end of the string and the 3 character intermediate delimiter of ",". The big problem with that is you'll also need to include the offset for 3 character delimiters in the other substrings so that you're not including the 2nd and 3rd character of the delimiter in the final returned ITEM elements. That's all going to take extra time and will punish-for-performance those applications that require only single delimiter splits.
You could write a separate function in T-SQL to do such a thing so as not to unnecessarily slow down single delimiter splits, but it would be better to write a decent CLR for such things as multi-character delimiters and "CSV Splits" or convince MS that they need to finally include a decent splitter function as a part of T-SQL and without it being in the form of an add-in such as the ACE drivers.
By the same token, you could do an extra step prior to using the splitter to replace the 3 character delimiters with a single "odd" delimiter not likely to be use in the text, and strip off the leading and trailing quotes and then use the single character splitter to do the job. (although I'm still in favor of using a well written CLR or future MS-provided built-in function).
--Jeff Moden
Change is inevitable... Change for the better is not.
September 26, 2014 at 6:39 am
Regarding the conversion error issue that I posted yesterday:
Tom, I tested the query on SQL 2008 R2 SP2 and SQL 2012 SP1, with the same results.
Paul, your explanation, along with Erland's Connect issue makes perfect sense. I would classify it as a "bug" as well, but I am very relieved now that I understand the reason for this behavior. Thank you very much for taking the time and efforts to reply and for pointing me to the connect issue.
September 26, 2014 at 8:13 am
Paul White (9/25/2014)
It is not a bug (in SQL Server, the T-SQL code has a bug, of course).There is always a temptation to read T-SQL as a sequence of steps, particularly where a string of CTEs is present, or there are multiple scopes e.g. a subquery inside an IN list. It is only natural to expect execution to follow the 'clearly intended' sequence of operations, but this is not how a declarative language like T-SQL works. In particular, scalar expressions may be evaluated earlier or later than expected, and they may even be executed more times than expected (if the expression is deterministic).
The original motivation for declarative languages was to make expression simpler and clearer by avoiding the program text having anything to do with irrelevancies like control flow, compiler internals, optimization; later on some schools of thought added additional restrictions (such as the "no side effects" rule which effectively bans state-oriented programming). The deification of the optimiser has ensured that T-SQL, already not a declarative language in the later sense, is not a declarative in any sense. Besides, any programming language needs comprehensible semantics, whether denotational semantics or operational semantics or both, and because of this approach to the meaning of program text (deciding to declare the absence of any meaning) T-SQL has neither.
As for "sequence of steps", I don't care whether there are steps or everything is jumbled together as long as the system obeys the simple logical rule that application of a function to an argument can not take place in such a manner as to apply the function inetead to something that is not actually the argument (a rule that is imposed even in applicative language variants where laziness or eagerness can be determined by the system at run time, which is about as far from a sequence of steps as you can get); what's happening here is a clear breach of that rule.
There are arguments to be made about optimization not introducing type-conversion errors that logically could not happen in the original T-SQL, but this is not a clear-cut case of that. In any event, that's not how SQL Server works today.
Thanks for the reference to Erland's connect item :cool:. I've upvoted it :-). It's nice to see that MS intends to address this some time in the future (but four and a half years have already passed since they said that).
The case raised by sfrostx is one where a function (whether explicit cast or explicit convert or implied convert, all three have the same effect)is applied to something other than its argument. An argument that this is not a clear-cut case of being illogical verges on farce. In fact it is exactly the same problemy as in Erland's second example in his connect item where an expression defining a derived table to be in another expression is used before it is evaluated (in Erland's case it's a CTE, in sfrostx's case it's the result of a table-valued function). And while substituting TRY_CAST for CAST is currently a workaround in sfrostx's case, there is no guarantee at all that it still be a workaround in the next release, or even after the next service pack or maybe the next cumulative update since the "logic" that justifies feeding something to CAST that should not be fed to it clearly also justifies the same thing for TRY_CAST. Besides, can we be sure that TRY_CAST will solve this for all possible GUID values even in SQL 2014?
Tom
September 26, 2014 at 1:22 pm
I do apologise for bringing "farce" to this discussion :pinch:. I know there are strong views on both sides about what is and isn't legal for the optimizer to do. I hope I added some value in explaining what happens and why.
September 26, 2014 at 6:03 pm
Paul White (9/26/2014)
I do apologise for bringing "farce" to this discussion :pinch:. I know there are strong views on both sides about what is and isn't legal for the optimizer to do. I hope I added some value in explaining what happens and why.
As usual Paul, your comments are valuable and insightful. The fact that we disagree on what is and what isn't declarative and on what is and what isn't reasonable semantics for an SQL dialect doesn't detract at all from my appreciation of that.
Tom
October 7, 2014 at 3:28 pm
Hello Jeff, Paul, and everyone else involved in this discussion.
I hope you are still following up on this. Great article indeed.
I took the liberty to make a minor change to your split function and I think I managed to achieve performance improvement proportional to the size of each @pString.
CREATE FUNCTION [dbo].[DelimitedSplit8K]
(@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
E4(N) AS (SELECT 1 FROM E2 a, E2 b),
cteTally(N) AS (
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
),
cteLen(N1,L1) AS(
SELECT s.N1,
ISNULL(lead(s.N1) over (order by s.N1)-s.N1-1,8000) -- REPLACED CHARINDEX OVER @PSTRING WITH LEAD
FROM cteStart s
)
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteLen l
;
As we already have a list of starting positions for each element, we can assume that the length of each element will be the starting position of the NEXT element, minus it's own starting position, minus the length of the delimiter. We can easily obtain the next element starting position by using the LEAD function and ordering by the starting position value. This also eliminates the need for the NULLIF function, because when there is no lead value, NULL is returned.
By replacing the CHARINDEX function over the @pString variable for length column calculation in the cteLen CTE, we avoid having to iterate through the whole @pString variable again (it was already done once to create the list of starting positions).
I was able to obtain a 22% performance improvement in my environment with this change 😉
October 7, 2014 at 3:49 pm
samirabrahao1 60347 (10/7/2014)
Hello Jeff, Paul, and everyone else involved in this discussion.I hope you are still following up on this. Great article indeed.
I took the liberty to make a minor change to your split function and I think I managed to achieve performance improvement proportional to the size of each @pString.
CREATE FUNCTION [dbo].[DelimitedSplit8K]
(@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
E4(N) AS (SELECT 1 FROM E2 a, E2 b),
cteTally(N) AS (
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
),
cteLen(N1,L1) AS(
SELECT s.N1,
ISNULL(lead(s.N1) over (order by s.N1)-s.N1-1,8000) -- REPLACED CHARINDEX OVER @PSTRING WITH LEAD
FROM cteStart s
)
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteLen l
;
As we already have a list of starting positions for each element, we can assume that the length of each element will be the starting position of the NEXT element, minus it's own starting position, minus the length of the delimiter. We can easily obtain the next element starting position by using the LEAD function and ordering by the starting position value. This also eliminates the need for the NULLIF function, because when there is no lead value, NULL is returned.
By replacing the CHARINDEX function over the @pString variable for length column calculation in the cteLen CTE, we avoid having to iterate through the whole @pString variable again (it was already done once to create the list of starting positions).
I was able to obtain a 22% performance improvement in my environment with this change 😉
Eirikur Eiriksson discusses how to use lead to make DelimitedSplit8K faster in this very excellent article:
Reaping the benefits of the Window functions in T-SQL[/url]
-- Itzik Ben-Gan 2001
November 20, 2014 at 12:34 am
Thanks for the article!
If I may add another solution for splitting a list of values into a table
here is a short function that does the trick:
CREATE FUNCTION FN_LIST_TO_TABLE (@List varchar(max) , @Delimiter varchar(1))
RETURNS table as return
With MyXml (MyX) as
(
select convert(XML,'<Row><a>' + REPLACE(@List, @Delimiter , '</a><a>' ) + '</a></Row>')
)
select t.x.value('.' ,'varchar(max)') Item
from MyXml a
cross apply Myx.nodes('/Row/a') t(x)
How to use:
SELECT * FROM FN_LIST_TO_TABLE ('1,2,3,4,5' , ',')
November 20, 2014 at 8:07 am
hkravitz (11/20/2014)
Thanks for the article!If I may add another solution for splitting a list of values into a table
here is a short function that does the trick:
CREATE FUNCTION FN_LIST_TO_TABLE (@List varchar(max) , @Delimiter varchar(1))
RETURNS table as return
With MyXml (MyX) as
(
select convert(XML,'<Row><a>' + REPLACE(@List, @Delimiter , '</a><a>' ) + '</a></Row>')
)
select t.x.value('.' ,'varchar(max)') Item
from MyXml a
cross apply Myx.nodes('/Row/a') t(x)
How to use:
SELECT * FROM FN_LIST_TO_TABLE ('1,2,3,4,5' , ',')
Thanks for your post. To be sure, that's one of the methods covered in the article and it's comparatively pretty slow because of the REPLACE. If you pass in XML as the parameter instead of using CSV, then shredding XML is very fast but if you have to convert a CSV to XML, then it's pretty slow.
Here's the chart from the article that shows just how slow the XML-based splitter is compared to other methods... It's about as slow as the two While loops methods and is actually worse on the lower end of that chart.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 20, 2014 at 9:35 am
I would suggest using a different element name than "a", since that is used for links and anchors.
November 20, 2014 at 7:56 pm
Another unfortunate aspect of the XML splitter (aside from barfing on certain characters, and being intellectually offensive ha ha) is that it may perform much, much worse than you would expect on longer strings...
November 23, 2014 at 9:36 am
TomThomson (9/25/2014)
sfrostx (9/25/2014)
Can anyone help me understand why the second and third selects in the following generate an error?I don't understand it at all. An interesting twist is that whether the failing select produces a row before producing its error message depends on the order in which the rows were inserted into @tbl. The conversion fails for the string beginning 7 and also for the string beginning C when the value it is to be produced is part of an IN list, but the same conversion works in that context for the string beginning 3. However, I can't see any way of making the conversion fail for either value except when operating inside an IN list. There's nothing special about the GUIDs, all three are version 4 GUIDs (assuming that MS still uses standard quartet order for the text representation of its GUIDs, which it certainly used to do). The two that don't work are MS GUID version 4 standard. The one that does work is MS GUID version 4 for NCS backward compatability, and I can't see MS treating that as a particularly privileged sort of GUID. So to me it looks like a bug, but maybe someone who knows more about how MS in general and SQL Server in particular treat GUIDs will come up with a non-bug explanation. If you don't get a better response from someone, maybe you should raise a connect item for it.
I haven't played with this on any SQL Server version other than 2014 - is that the version you hit the problem on?
edit: I see Paul has a different, probably more useful, answer. As I read his answer, this definitely is a bug. But MS would deny that because the optimiser is a sacred cow. It's certainly interesting that TRY_CONVERT and TRY_CAST work when neither CONVERT nor CAST does, so at least there's a workaround.
Aren't the operations you are arguing about or stating are a bug COMMUTATIVE in nature? If so, it is illogical (in every sense of the word) to state that they cannot be reordered as the optimizer sees fit.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 28, 2014 at 3:34 am
TheSQLGuru (11/23/2014)
TomThomson (9/25/2014)
sfrostx (9/25/2014)
Can anyone help me understand why the second and third selects in the following generate an error?I don't understand it at all. An interesting twist is that whether the failing select produces a row before producing its error message depends on the order in which the rows were inserted into @tbl. The conversion fails for the string beginning 7 and also for the string beginning C when the value it is to be produced is part of an IN list, but the same conversion works in that context for the string beginning 3. However, I can't see any way of making the conversion fail for either value except when operating inside an IN list. There's nothing special about the GUIDs, all three are version 4 GUIDs (assuming that MS still uses standard quartet order for the text representation of its GUIDs, which it certainly used to do). The two that don't work are MS GUID version 4 standard. The one that does work is MS GUID version 4 for NCS backward compatability, and I can't see MS treating that as a particularly privileged sort of GUID. So to me it looks like a bug, but maybe someone who knows more about how MS in general and SQL Server in particular treat GUIDs will come up with a non-bug explanation. If you don't get a better response from someone, maybe you should raise a connect item for it.
I haven't played with this on any SQL Server version other than 2014 - is that the version you hit the problem on?
edit: I see Paul has a different, probably more useful, answer. As I read his answer, this definitely is a bug. But MS would deny that because the optimiser is a sacred cow. It's certainly interesting that TRY_CONVERT and TRY_CAST work when neither CONVERT nor CAST does, so at least there's a workaround.
Aren't the operations you are arguing about or stating are a bug COMMUTATIVE in nature? If so, it is illogical (in every sense of the word) to state that they cannot be reordered as the optimizer sees fit.
One of teh two fuctions in question is CAST, so you seem to be suggesting that (for example) CAST(F(X)) is identical in all respects to F(CAST(X)) where you can see what F is by looking at the code which demonstartes the bizarre behavious. That's what it means to say that the functions CAST and F are commuttive. What's happening here is that the optimizer is presented with CAST(F(X)) for some expresion X and function F, and decides to call CAST(X) before it thinks about applying F. But F and CAsT are not communtative, so the optimizer should not do that. MS have admitted that in response to Erland's CONNECT item, but have done nothing about it (years after they said they intended to fix it).
Tom
January 2, 2015 at 1:26 pm
I need some help if someone is willing. I'm not a professional DBA however with Google as your friend anything is possible. My issue is that I have a table that has say 300 rows in it (could be up to 10,000 later on) and my deliverable is taking one column of string data and parsing it to 6 other columns on the same row in the same table. I have worked out the custom split I need using one variable and Jeff's Tally OH! My problem is I haven't figured out the best way (or for that matter any way) to feed the splitter one field at a time on the current row and then parse and insert the other 6 columns on that same row.
Current custom parse code:
Declare @pDelimiter char(1),
@pString varchar(8000)
Set @pDelimiter='_'
--Set @pString=(SELECT [TagName] FROM [dbo].[TagReporting])
Set @pString='CS12_PrbTmp_L_DegF_1M'
;
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT
RN=ROW_NUMBER() OVER(ORDER BY N1)
,N1
,L1
--Original columns for tally table. Shows actual split
--ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
--Item = SUBSTRING(@pString, l.N1, l.L1)
INTO #temp1
FROM cteLen l
Declare
@N1SN int
,@L1SN int
,@N1PM int
,@L1PM int
,@N1PQ int
,@L1PQ int
,@N1QL int
,@L1QL int
,@N1UN int
,@L1UN int
,@N1AP int
,@L1AP int
,@RN int
SELECT
@RN=(select max(rn) from #temp1)
FROM #temp1
SELECT
@N1SN=N1
,@L1SN=L1
FROM #temp1
WHERE rn=1
SELECT
@N1PM=N1
,@L1PM=L1
FROM #temp1
WHERE rn=2
SELECT
@N1UN=N1
,@L1UN=L1
FROM #temp1
WHERE rn=@RN-1
SELECT
@N1AP=N1
,@L1AP=L1
FROM #temp1
WHERE rn=@RN
SELECT
@N1PQ=(select N1 from #temp1 where rn=2)
,@L1PQ=(@N1UN)-(select N1 from #temp1 where rn=2)-1
FROM #temp1
SELECT
@N1QL=(select N1 from #temp1 where rn=3)
,@L1QL=(@N1UN)-(select N1 from #temp1 where rn=3)-1
FROM #temp1
SET @N1QL=
CASE
WHEN @RN > 4 AND @L1QL < 7 THEN
@N1QL
ELSE NULL
END
SET @L1QL=
CASE
WHEN @RN > 4 AND @L1QL < 7 THEN
@L1QL
ELSE NULL
END
INSERT INTO #temp2
([STD_SiteName]
,[STD_Parameter]
,[STD_PrmQual]
,[STD_Qualifier]
,[STD_Units]
,[STD_AvgPer]
)
VALUES
( substring(@pString, @N1SN, @L1SN)
,substring(@pString, @N1PM, @L1PM)
,substring(@pString, @N1PQ, @L1PQ)
,substring(@pString, @N1QL, @L1QL)
,substring(@pString, @N1UN, @L1UN)
,substring(@pString, @N1AP, @L1AP)
)
select * from #temp2
drop table #temp1
--drop table #temp2
--create table #temp2
-- (
--[STD_SiteName] [varchar](20) NULL,
--[STD_Parameter] [varchar](16) NULL,
--[STD_Qualifier] [varchar](6) NULL,
----[STD_PQ] AS (([STD_Parameter]+'_')+[STD_Qualifier]) PERSISTED NOT NULL,
--[STD_PrmQual] [varchar](22) NULL,
--[STD_Units] [varchar](12) NULL,
--[STD_AvgPer] [varchar](6) NULL
--)
Another database table with additional data will replace the #temp2 table.
Thanks for any help.
Rich
Viewing 15 posts - 691 through 705 (of 990 total)
You must be logged in to reply to this topic. Login to reply