January 2, 2015 at 1:38 pm
o324712 (1/2/2015)
I need some help if someone is willing. I'm not a professional DBA however with Google as your friend anything is possible. My issue is that I have a table that has say 300 rows in it (could be up to 10,000 later on) and my deliverable is taking one column of string data and parsing it to 6 other columns on the same row in the same table. I have worked out the custom split I need using one variable and Jeff's Tally OH! My problem is I haven't figured out the best way (or for that matter any way) to feed the splitter one field at a time on the current row and then parse and insert the other 6 columns on that same row.Thanks for any help.
Rich
Can you give an example using actual or dummy data of what you have and what result you want? I don't think that you even need to use Jeff's split function if I correctly understand what you are looking for.
Jason
January 2, 2015 at 2:28 pm
This is in the code and the sample I am using to develop the custom split, Set @pString='CS12_PrbTmp_L_DegF_1M'.
I forgot to mention I do have a PK column for ID in the final table.
Yeah I do need his code because of non-standard string construction. See code for column STD_PrmQual.
STD_SiteNameSTD_ParameterSTD_QualifierSTD_PrmQualSTD_UnitsSTD_AvgPer
CS12 PrbTmp L PrbTmp_LDegF 1M
January 2, 2015 at 2:31 pm
o324712 (1/2/2015)
This is in the code and the sample I am using to develop the custom split, Set @pString='CS12_PrbTmp_L_DegF_1M'.I for got to mention I do have a PK column for ID in the final table.
STD_SiteNameSTD_ParameterSTD_QualifierSTD_PrmQualSTD_UnitsSTD_AvgPer
CS12 PrbTmp L PrbTmp_LDegF 1M
Best suggestion, please start a new thread for your problem instead of using the discussion thread for the article. Will make it easier for you, those that help, and those who read it later on.
January 2, 2015 at 2:48 pm
Sorry, I will, just assumed this would be the correct place.
Thanks,
Rich
January 2, 2015 at 3:14 pm
o324712 (1/2/2015)
Sorry, I will, just assumed this would be the correct place.Thanks,
Rich
For those looking for this new thread, here it is: http://www.sqlservercentral.com/Forums/Topic1647953-3077-1.aspx
January 21, 2015 at 1:08 pm
I've been working on performance tuning and optimization of few special purpose functions which where derived from the DelimitedSplit8K function, wanted to share some of the findings related to the function. As these functions use the same string iteration/parsing part as the 2012 LEAD version of the DelimitedSplit8K, it is the perfect test bead.
The two variables in this test are the cardinality of the seeding CTE for the inline tally table, hence different number of cross joins and the difference between using 1 and NULL in the seeding CTE.
Although some of this has been covered on this and related threads before, I see no harm in posting this summary nor do I know of a better place to do so;-)
😎
Naming convention:
[Function Name][Number of seed elements][X=NULL elements] i.e. DelimitedSplitL8K90X has 90 seed elements of NULL value, therefore one cross join as 90^2 = 8100.
Results (three runs of 10^6 rows)
FUNC_NAME DURATION
----------------------- ----------
Dry run 202801
DelimitedSplitL8KX 27783649
DelimitedSplitL8K 27814849
DelimitedSplitL8K8000X 29406051
DelimitedSplitL8K8000 29593252
DelimitedSplitL8K90X 30076853
DelimitedSplitL8K90 30092453
DelimitedSplitL8K20X 30279653
DelimitedSplitL8K20 30435653
DelimitedSplitL8K10X 30685254
DelimitedSplitL8K10 30810054
FUNC_NAME DURATION
----------------------- ----------
Dry run 202800
DelimitedSplitL8KX 27705649
DelimitedSplitL8K 27861649
DelimitedSplitL8K8000X 29406052
DelimitedSplitL8K8000 29452851
DelimitedSplitL8K90 29952053
DelimitedSplitL8K90X 30045653
DelimitedSplitL8K20X 30388853
DelimitedSplitL8K20 30420053
DelimitedSplitL8K10X 30591654
DelimitedSplitL8K10 30638454
FUNC_NAME DURATION
----------------------- ----------
Dry run 202800
DelimitedSplitL8KX 27861649
DelimitedSplitL8K 27955249
DelimitedSplitL8K8000X 29390451
DelimitedSplitL8K8000 29515252
DelimitedSplitL8K90X 29905253
DelimitedSplitL8K90 30123653
DelimitedSplitL8K20 30232853
DelimitedSplitL8K20X 30310853
DelimitedSplitL8K10 30654054
DelimitedSplitL8K10X 30747654
Edit: typo
March 10, 2015 at 12:15 am
Whoohoo! Finally a chance to contribute to this thread. Possibly. Someone please check me on this.
Below is a rather simplistic test harness that assumes we have a fair number of strings that contain no delimiter so ultimately don't need to be split. Some 10% of the string data is actually NULL.
DECLARE @Delim CHAR(1) = ',';
CREATE TABLE #SampleData
(
s VARCHAR(8000)
);
WITH Tally (n) AS
(
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO #SampleData (s)
SELECT CASE
-- 70% of the strings contain no delimiter (no split required)
WHEN rn BETWEEN 11 AND 80
THEN REPLICATE('abcd', ABS(CHECKSUM(NEWID()))%2000)
-- 20% of the strings contain a delimiter (split required)
WHEN rn BETWEEN 81 AND 100
THEN REPLICATE('abc,', ABS(CHECKSUM(NEWID()))%2000)
-- Remaining 10% of the strings are NULL
END
FROM (SELECT rn=1+ABS(CHECKSUM(NEWID()))%100) a
CROSS JOIN Tally b;
--SELECT *
--FROM #SampleData;
PRINT 'Don''t split strings that don''t need splitting';
SET STATISTICS TIME ON;
SELECT s, ItemNumber, Item
INTO #Test1
FROM #SampleData a
CROSS APPLY
(
SELECT Switch=CASE WHEN s IS NULL OR CHARINDEX(@Delim, s) = 0 THEN 0 ELSE 1 END
) b
CROSS APPLY
(
SELECT ItemNumber=1, Item=s
WHERE Switch = 0
UNION ALL
SELECT ItemNumber, Item
FROM dbo.DelimitedSplit8K(a.s, @Delim) c
WHERE Switch <> 0
) c;
SET STATISTICS TIME OFF;
PRINT 'Run DelimitedSplit8K on all strings if they need splitting or not';
SET STATISTICS TIME ON;
SELECT s, ItemNumber, Item
INTO #Test2
FROM #SampleData a
CROSS APPLY dbo.DelimitedSplit8K(a.s, @Delim) b;
SET STATISTICS TIME OFF;
GO
DROP TABLE #SampleData;
DROP TABLE #Test1;
DROP TABLE #Test2;
My results (running on SQL 2012) are:
Don't split strings that don't need splitting
SQL Server Execution Times:
CPU time = 6755 ms, elapsed time = 17515 ms.
(200422 row(s) affected)
Run DelimitedSplit8K on all strings if they need splitting or not
SQL Server Execution Times:
CPU time = 9952 ms, elapsed time = 19045 ms.
(200422 row(s) affected)
As you can see what I did was to create a "Switch" that partitions the set into what needs splitting and what does not (first CROSS APPLY). Then in the second CROSS APPLY, I partition the set calling DelimitedSplit8K only where there's a delimiter present.
I stumbled across this rather interesting approach when I had a much more compute-intensive schema-bound, iTVF inside of an OUTER APPLY. In that case, I was able to reduce elapsed time by more than 90% by short-circuiting the iTVF call using a similarly devised switch and substituting equivalent data where I could.
I thought I'd give it a shot with DelimitedSplit8K because that is known to be very high performance (note the version I'm using may not be the latest and is definitely not Erikur's SQL 2012 version). I'm not trying to suggest I've got a faster way to split strings. I am suggesting I may have a way to improve a query's performance if it is using an iTVF where the workload can be partitioned into cases where you do and where you don't (need to use the iTVF).
Most likely the value is limited in cases where the iTVF is highly optimized, but in cases where it is pretty compute intensive, this could be quite useful.
And excuse me if someone has already covered this elsewhere in the discussion thread. That's a bit long to review today.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
March 10, 2015 at 12:37 am
Very interesting Dwain, and works perfectly in your test harness, maybe I need more coffee in the morning but when I plug this into my normal harness the results are quite different, can you see what I'm doing wrong?
😎
USE tempdb;
GO
SET NOCOUNT ON;
DECLARE @Delim CHAR(1) = ',';
CREATE TABLE #SampleData
(
s VARCHAR(8000)
);
WITH Tally (n) AS
(
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO #SampleData (s)
SELECT CASE
-- 70% of the strings contain no delimiter (no split required)
WHEN rn BETWEEN 11 AND 80
THEN REPLICATE('abcd', ABS(CHECKSUM(NEWID()))%2000)
-- 20% of the strings contain a delimiter (split required)
WHEN rn BETWEEN 81 AND 100
THEN REPLICATE('abc,', ABS(CHECKSUM(NEWID()))%2000)
-- Remaining 10% of the strings are NULL
END
FROM (SELECT rn=1+ABS(CHECKSUM(NEWID()))%100) a
CROSS JOIN Tally b;
--SELECT *
--FROM #SampleData;
DECLARE @timer TABLE (T_TEXT VARCHAR(100) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));
DECLARE @INT_BUCKET INT = 0;
DECLARE @CHR_BUCKET VARCHAR(2000) = '';
DECLARE @STR_BUCKET VARCHAR(2000) = '';
INSERT INTO @timer(T_TEXT) VALUES ('Don''t split strings that don''t need splitting')
SELECT
@STR_BUCKET = s
,@INT_BUCKET = ItemNumber
,@CHR_BUCKET = Item
FROM #SampleData a
CROSS APPLY
(
SELECT Switch=CASE WHEN s IS NULL OR CHARINDEX(@Delim, s) = 0 THEN 0 ELSE 1 END
) b
CROSS APPLY
(
SELECT ItemNumber=1, Item=s
WHERE Switch = 0
UNION ALL
SELECT ItemNumber, Item
FROM dbo.DelimitedSplitL8K(a.s, @Delim) c
WHERE Switch <> 0
) c;
INSERT INTO @timer(T_TEXT) VALUES ('Don''t split strings that don''t need splitting')
INSERT INTO @timer(T_TEXT) VALUES ('Run DelimitedSplit8K on all strings if they need splitting or not')
SELECT
@STR_BUCKET = s
,@INT_BUCKET= ItemNumber
,@CHR_BUCKET = Item
FROM #SampleData a
CROSS APPLY dbo.DelimitedSplitL8K(a.s, @Delim) b;
INSERT INTO @timer(T_TEXT) VALUES ('Run DelimitedSplit8K on all strings if they need splitting or not')
SELECT
T.T_TEXT
,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION
FROM @timer T
GROUP BY T.T_TEXT
ORDER BY DURATION ASC
DROP TABLE #SampleData;
Results
T_TEXT DURATION
------------------------------------------------------------------- -----------
Run DelimitedSplit8K on all strings if they need splitting or not 1372079
Don't split strings that don't need splitting 1894108
Server version
Microsoft SQL Server 2014 - 12.0.2000.8 (X64)
Feb 20 2014 20:04:26
Copyright (c) Microsoft Corporation
Developer Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)
March 10, 2015 at 1:01 am
Erikur,
My results with your harness confirm what you got after I changed from DelimitedSplitL8K to DelimitedSplit8K.
T_TEXT DURATION
Run DelimitedSplit8K on all strings if they need splitting or not 1919981
Don't split strings that don't need splitting 2139979
Perhaps the difference is caused by shunting results to a local variable as opposed to a temp table like I did. Can't say for sure.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
March 10, 2015 at 1:07 am
I guess the real point of my post was not to offer competition to DelimitedSplit8K (or any of its variants).
Rather to see if anybody has ever seen this "short-circuiting of an iTVF" before. It looks interesting and it certainly provided me a whopping performance benefit in the case where I was able to apply it. Unfortunately, that's quite complicated to post about, so I thought to try it with DelimintedSplit8K as an easy example.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
March 10, 2015 at 1:50 am
dwain.c (3/10/2015)
I guess the real point of my post was not to offer competition to DelimitedSplit8K (or any of its variants). Rather to see if anybody has ever seen this "short-circuiting of an iTVF" before. It looks interesting and it certainly provided me a whopping performance benefit in the case where I was able to apply it.
I can see potentials in this short-circuiting method for exactly this reason, very interesting indeed, quite different from the methods I've used.
😎
March 10, 2015 at 1:56 am
Eirikur Eiriksson (3/10/2015)
dwain.c (3/10/2015)
I guess the real point of my post was not to offer competition to DelimitedSplit8K (or any of its variants). Rather to see if anybody has ever seen this "short-circuiting of an iTVF" before. It looks interesting and it certainly provided me a whopping performance benefit in the case where I was able to apply it.I can see potentials in this short-circuiting method for exactly this reason, very interesting indeed, quite different from the methods I've used.
😎
Indeed I agree, very interesting.
I think I even know why it works. It is probably because of the "in-lining" feature of this kind of TVF. I'd bet it doesn't work worth a hoot on a multi-line TVF.
Edit: Of course, it does mean you need to "know your data" so you can identify cases when it will work. But I've found that to be good advice in many other circumstances as well.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
March 10, 2015 at 2:49 am
dwain.c (3/10/2015)
I guess the real point of my post was not to offer competition to DelimitedSplit8K (or any of its variants).Rather to see if anybody has ever seen this "short-circuiting of an iTVF" before. It looks interesting and it certainly provided me a whopping performance benefit in the case where I was able to apply it. Unfortunately, that's quite complicated to post about, so I thought to try it with DelimintedSplit8K as an easy example.
Absolutely. Working on a title/name cleansing suite for marketing data a few years ago. The code I ended up using had several short-circuits in it. Of course if you can short-circuit - which is streaming your processing into two paths depending on the data - you can also define multiple data-dependant processing paths which can be a lot more sophisticated than SELECT-CASE-WHERE or UPDATE-CASE-WHERE. It confuses the hell out of the optimiser but if you have to scan through a single table with maybe a lookup or two then options are already limited.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 10, 2015 at 6:59 am
Played around a little with the application of short-circuiting and ended up with something close to what I normally use for multi-statement table value functions, 15% improvement on this particular set in this harness.
😎
;WITH BASE_DATA AS
(
SELECT
SD.s
,ISNULL(X.CI,0) AS SPLIT_FLAG
FROM #SampleData SD
CROSS APPLY (SELECT SIGN(CHARINDEX(@Delim,SD.s,1))) AS X(CI)
)
,CROSS_BREEDS AS
(
SELECT
X.ItemNumber
,ISNULL(X.Item,BD.SPLIT_FLAG) AS Item
FROM BASE_DATA BD
CROSS APPLY dbo.DelimitedSplit8K(BD.s,@Delim) AS X
WHERE BD.SPLIT_FLAG = 1
UNION ALL
SELECT
1
,BD.s
FROM BASE_DATA BD
WHERE BD.SPLIT_FLAG = 0
)
SELECT
@INT_BUCKET = CB.ItemNumber
,@CHR_BUCKET = CB.Item
FROM CROSS_BREEDS CB
The test harness
USE tempdb;
GO
SET NOCOUNT ON;
CREATE TABLE #SampleData
(
s VARCHAR(8000)
);
WITH Tally (n) AS
(
SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
INSERT INTO #SampleData (s)
SELECT CASE
-- 70% of the strings contain no delimiter (no split required)
WHEN rn BETWEEN 11 AND 80
THEN REPLICATE('abcd', ABS(CHECKSUM(NEWID()))%2000)
-- 20% of the strings contain a delimiter (split required)
WHEN rn BETWEEN 81 AND 100
THEN REPLICATE('abc,', ABS(CHECKSUM(NEWID()))%2000)
-- Remaining 10% of the strings are NULL
END
FROM (SELECT rn=1+ABS(CHECKSUM(NEWID()))%100) a
CROSS JOIN Tally b;
DECLARE @Delim CHAR(1) = ',';
DECLARE @timer TABLE (T_TEXT VARCHAR(100) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));
DECLARE @INT_BUCKET INT = 0;
DECLARE @CHR_BUCKET VARCHAR(2000) = '';
DECLARE @STR_BUCKET VARCHAR(2000) = '';
INSERT INTO @timer(T_TEXT) VALUES ('Dry run')
SELECT
@STR_BUCKET = SD.s
FROM #SampleData SD
INSERT INTO @timer(T_TEXT) VALUES ('Dry run')
INSERT INTO @timer(T_TEXT) VALUES ('Run DelimitedSplit8K on all strings if they need splitting or not')
SELECT
@STR_BUCKET = s
,@INT_BUCKET= ItemNumber
,@CHR_BUCKET = Item
FROM #SampleData a
CROSS APPLY dbo.DelimitedSplit8K(a.s, @Delim) b;
INSERT INTO @timer(T_TEXT) VALUES ('Run DelimitedSplit8K on all strings if they need splitting or not')
INSERT INTO @timer(T_TEXT) VALUES ('Don''t split strings that don''t need splitting')
SELECT
@STR_BUCKET = s
,@INT_BUCKET = ItemNumber
,@CHR_BUCKET = Item
FROM #SampleData a
CROSS APPLY
(
SELECT Switch=CASE WHEN s IS NULL OR CHARINDEX(@Delim, s) = 0 THEN 0 ELSE 1 END
) b
CROSS APPLY
(
SELECT ItemNumber=1, Item=s
WHERE Switch = 0
UNION ALL
SELECT ItemNumber, Item
FROM dbo.DelimitedSplit8K(a.s, @Delim) c
WHERE Switch <> 0
) c;
INSERT INTO @timer(T_TEXT) VALUES ('Don''t split strings that don''t need splitting')
INSERT INTO @timer(T_TEXT) VALUES ('EE Don''t split strings that don''t need splitting')
;WITH BASE_DATA AS
(
SELECT
SD.s
,ISNULL(X.CI,0) AS SPLIT_FLAG
FROM #SampleData SD
CROSS APPLY (SELECT SIGN(CHARINDEX(@Delim,SD.s,1))) AS X(CI)
)
,CROSS_BREEDS AS
(
SELECT
X.ItemNumber
,ISNULL(X.Item,BD.SPLIT_FLAG) AS Item
FROM BASE_DATA BD
CROSS APPLY dbo.DelimitedSplit8K(BD.s,@Delim) AS X
WHERE BD.SPLIT_FLAG = 1
UNION ALL
SELECT
1
,BD.s
FROM BASE_DATA BD
WHERE BD.SPLIT_FLAG = 0
)
SELECT
@INT_BUCKET = CB.ItemNumber
,@CHR_BUCKET = CB.Item
FROM CROSS_BREEDS CB
INSERT INTO @timer(T_TEXT) VALUES ('EE Don''t split strings that don''t need splitting')
SELECT
T.T_TEXT
,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION
FROM @timer T
GROUP BY T.T_TEXT
ORDER BY DURATION ASC
DROP TABLE #SampleData;
Results (hot)
T_TEXT DURATION
------------------------------------------------------------------ -----------
Dry run 0
EE Don't split strings that don't need splitting 1236400
Run DelimitedSplit8K on all strings if they need splitting or not 1463200
Don't split strings that don't need splitting 1499600
Server Version
Microsoft SQL Server 2012 - 11.0.5058.0 (X64)
May 14 2014 18:34:29
Copyright (c) Microsoft Corporation
Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)
March 10, 2015 at 6:49 pm
Erikur,
Thanks for continuing to play around with it. I did too and I am getting some inconsistent results with DelimitedSplit8K using my original approach. I believe this is due to the already high performance of the iTVF.
I also applied it a case using a splitter I wrote about sometime in the past (see my signature link about splitting strings based on patterns) called PatternSplitCM. There the results were more consistent because that iTVF is doing more work.
I expect that there is probably a relationship between the performance of the iTVF and the number of degenerate cases in your underlying data. The higher the performance of the iTVF the more degenerate cases in the underlying data you need for it to be effective. Edit: Also what works in the bulk case (many rows) may not be effective (and may add way too much overhead) if you're only doing this on a minimal number of rows.
I'm a little surprised that more people haven't come along and said (like ChrisM) "this is old hat." Googling on "short circuit SQL iTVF" certainly brings me no joy.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 15 posts - 706 through 720 (of 990 total)
You must be logged in to reply to this topic. Login to reply