June 2, 2017 at 9:13 am
Eirikur Eiriksson - Friday, June 2, 2017 8:38 AMJacob Wilkins - Friday, June 2, 2017 8:18 AMEirikur Eiriksson - Friday, June 2, 2017 2:22 AMBe careful with that order by clause, it will significantly slow down the query!
😎Indeed.
I kept it included because very strictly speaking there is no guarantee of correctness of the query without it.
To guarantee the query is correct, you have to know that the N returned for a string of length 3 will be 1,2, and 3.
Anything else would yield incorrect results.
As you did, I noticed that I seemed to get correct results every time without it, but without the ORDER BY, there's just no guarantee you'll get the right N from the APPLY.
Perhaps the engine will actually return them in the desired order under every possible circumstance, but I'm a paranoid DBA,so I couldn't bring myself to make that assumption 🙂
From a set based perspective, the order of appearance is irrelevant, the result set will be correct with or without the order by clause as the N will always correspond correctly to the position of the character returned. In fact, when using MAXDOP 1, the output is always appearing in the correct order although as you said, there is no guarantee that will always be so.
😎A greater culprit is not using the TOP operator, especially when handling greater variety of input length, the constant scan operator will feed the join operator maximum number of rows for every iteration.
It's not the presentation order I'm talking about; it's actually the correctness of the query.
We're returning some values for N from the APPLY, and using them in the SUBSTRING function.
Let's say we're dealing with a category with LEN of 3.
It matters very much whether the three N returned are {1,2,3} or {2,4,5} or something else.
If the three N returned are anything but {1,2,3}, the results returned for that ID and category will not be correct.
To actually guarantee that you return only N from 1 to LEN(category), and subsequently get each character of the category from the SUBSTRING, you need the ORDER BY.
Cheers!
June 2, 2017 at 2:16 pm
Eirikur Eiriksson - Friday, June 2, 2017 3:23 AMUsing the LEN function in the WHERE clause is a killer here, slows the query down several times, much more efficient to use the TOP operator within the numbers CTE. The reason for this is that when using the former, the server will generate the full output of the CTE and then filter is applied afterwards whilst the TOP operator will limit the number generated.
😎
Eirikur, you sure hit the nail on the head there. After tearing into it a little more, the LEN in the WHERE clause just kills it.
It looks like using the ITVF instead of the table-value constructor in the CROSS APPLY makes a real difference. Here's another variation to run through your mill. My timing is meaningless because we have completely different hardware.
The plan looks horrible because of the function (it always looks that way), but it performs well.
INSERT INTO @times(task,d) VALUES ('TallyN #3', GETDATE());
SELECT @id = ID,
@n = t.N,
@char = SUBSTRING(d.Category, t.N, 1)
FROM #test d
CROSS APPLY (SELECT TOP (LEN(d.Category)) N
FROM util.dbo.TallyN(5)) t (N);
INSERT INTO @times(task,d) VALUES ('TallyN #3', GETDATE());
June 2, 2017 at 5:21 pm
Ok, so I finally got a chance to look at this again.
There are a couple ways of modifying the strictly-speaking-unsafe queries to make them safe that don't result in too much of a slowdown.
One is by using ROW_NUMBER() to create the values in the APPLY, and the other is just to add a WHERE N<=LEN(Category) to query in the APPLY.
The latter is slightly but consistently the faster of the two on my machine.
One interesting thing of note is that using OUTER APPLY instead of CROSS APPLY is consistently good for an improvement of about 300ms or so across these similar solutions.
Here are those modified queries, along with a modified TallyN #3 (modified to get the modest speed bump from OUTER APPLY, and a MAXDOP 1 added, since it also likes going parallel), in a form to be used with Eirikur's test harness:
INSERT INTO @timer(T_TXT) VALUES ('Jacob #3');
SELECT
@INT_BUCKET = SSC.SSC_ID
,@INT_BUCKET = NM.N
,@CHAR_BUCKET = SUBSTRING(SSC.CATEGORY,NM.N,1)
FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC
OUTER APPLY (SELECT TOP(LEN(SSC.CATEGORY)) N=ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM (VALUES (0),(0),(0),(0),(0)) X(N)) AS NM
OPTION (MAXDOP 1);
INSERT INTO @timer(T_TXT) VALUES ('Jacob #3');
INSERT INTO @timer(T_TXT) VALUES ('Jacob #4');
SELECT @INT_BUCKET = SSC.SSC_ID,
@INT_BUCKET = t.n,
@CHAR_BUCKET = SUBSTRING(SSC.CATEGORY, t.N, 1)
FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC
OUTER APPLY
(
SELECT TOP (LEN(category)) n
FROM (VALUES(1),(2),(3),(4),(5))n(n)
WHERE n<=LEN(category)
)t(n)
OPTION (MAXDOP 1);
INSERT INTO @timer(T_TXT) VALUES ('Jacob #4');
INSERT INTO @timer(T_TXT) VALUES ('TallyN #3');
SELECT @INT_BUCKET = SSC.SSC_ID,
@INT_BUCKET = t.n,
@CHAR_BUCKET = SUBSTRING(SSC.CATEGORY, t.N, 1)
FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC
OUTER APPLY (SELECT TOP (LEN(SSC.Category)) N
FROM TallyN(5)) t (N)
OPTION (MAXDOP 1);
INSERT INTO @timer(T_TXT) VALUES ('TallyN #3');
And some representative results on my system, with Jacob #1 and Jacob #2 removed (#1 was much slower than these, and the #2 Eirikur provided wasn't all that different from Eirikur's, so removing them helped get in more test runs):
T_TXT DURATION
-------------------- -----------
DRY RUN 526000
EE OUTER APPLY 2550000
EE CROSS APPLY 2853000
Jacob #4 3273000
Jacob #3 3650000
TallyN #3 8630000:
June 2, 2017 at 7:00 pm
I don't know I'm doing differently than you guys, but my times are a whole lot different.
Run Time
--------------- -----
EE Outer Apply 1026
Jacob #4 1383
TallyN #3 2356
Jacob #1 4756
And yes, I've already checked my row counts to make sure.
June 3, 2017 at 1:57 am
Ed Wagner - Friday, June 2, 2017 7:00 PMI don't know I'm doing differently than you guys, but my times are a whole lot different.
Run Time
--------------- -----
EE Outer Apply 1026
Jacob #4 1383
TallyN #3 2356
Jacob #1 4756And yes, I've already checked my row counts to make sure.
Ed, the difference is that you are using milliseconds in the final timer result query whereas we are using microseconds😉
😎
June 3, 2017 at 2:14 am
Ed Wagner - Thursday, June 1, 2017 7:28 PMI see I forgot to define dbo.TallyN. It's Itzik's zero-read tally ITVF. I use it so much I sometimes think it's a part of SQL Server. It isn't, but it should be.
ALTER FUNCTION dbo.TallyN(@N Bigint) RETURNS TABLE WITH SCHEMABINDING
AS
--Credit: This function was written by Itzik Ben-Gan at http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
RETURN WITH level0 AS (
SELECT 0 AS g UNION ALL SELECT 0), --2
level1 AS (SELECT 0 AS g FROM level0 AS a CROSS JOIN level0 AS b), --2^2 = 4
level2 AS (SELECT 0 AS g FROM level1 AS a CROSS JOIN level1 AS b), --4^2 = 16
level3 AS (SELECT 0 AS g FROM level2 AS a CROSS JOIN level2 AS b), --16^2 = 256
level4 AS (SELECT 0 AS g FROM level3 AS a CROSS JOIN level3 AS b), --256^2 = 65536
level5 AS (SELECT 0 AS g FROM level4 AS a CROSS JOIN level4 AS b), --65536^2 = 4294967296
Tally AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM level5)
SELECT TOP (@N) N
FROM Tally
ORDER BY N;
If I'm working with varchar(8000) or less then I normally use a simpler version which will output the maximum of 10K rows. It tends to run quite a lot faster than the full blown version.
😎
CREATE OR ALTER FUNCTION dbo.NUMGEN10K
(
@MAX_N INT
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
SELECT TOP(@MAX_N) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
FROM T T1,T T2,T T3,T T4
GO
June 5, 2017 at 2:38 pm
Eirikur Eiriksson - Saturday, June 3, 2017 1:57 AMEd Wagner - Friday, June 2, 2017 7:00 PMI don't know I'm doing differently than you guys, but my times are a whole lot different.
Run Time
--------------- -----
EE Outer Apply 1026
Jacob #4 1383
TallyN #3 2356
Jacob #1 4756And yes, I've already checked my row counts to make sure.
Ed, the difference is that you are using milliseconds in the final timer result query whereas we are using microseconds😉
😎
Yeah, but it's more than that. Even after dividing your times, mine are at least half of yours. Sometimes they're really different and the rank is different. Your EE OUTER APPLY wins in either case, but the times are proportionally different from each other. I've learned something from the whole discussion, so I think that having fun has been worth it - the differences are what bugs me. You know, I'm content to just let it go, but I will play with it a bit more on my machine. 😉
June 6, 2017 at 9:18 am
You have me curious, Ed. 🙂
Having said that, it looks like the main difference is the ratio of Jacob #1 to the others that is different in your results. Everything else looks approximately right (full disclaimer, I'm eyeballing that).
Mind sharing the full test script you're using?
Cheers!
June 6, 2017 at 9:22 am
Jacob Wilkins - Tuesday, June 6, 2017 9:18 AMYou have me curious, Ed. 🙂Having said that, it looks like the main difference is the ratio of Jacob #1 to the others that is different in your results. Everything else looks approximately right (full disclaimer, I'm eyeballing that).
Mind sharing the full test script you're using?
Cheers!
Not at all. I don't have it here at work, but it should still be in an SSMS window at home. And yes, it's saved for future playing.
June 8, 2017 at 12:19 pm
June 21, 2017 at 9:37 am
Just for gits and shiggles, I figured I'd try this on for size, and try another option based on Jeff Moden's DelimitedSplit8K function.
Here's the function with the delimiter removed and everything stripped out that didn't need to be there:USE [LOCAL_DB]
GO
IF OBJECT_ID(N'dbo.NonDelimitedSplit8K', N'IF') IS NOT NULL
BEGIN
DROP FUNCTION dbo.NonDelimitedSplit8K;
END;
GO
/****** Object: UserDefinedFunction [dbo].[DelimitedSplit8K] Script Date: 6/21/2017 9:50:34 AM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION dbo.NonDelimitedSplit8K (
@pString varchar(5)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
--10E+1 or 10 rows
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 --UNION ALL SELECT 1 UNION ALL
--SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E1
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N),
Item = SUBSTRING(@pString, l.N, 1)
FROM cteTally l;
GO
Now here's the test rig:USE LOCAL_DB;
GO
SET NOCOUNT ON;
DBCC FREEPROCCACHE;
/*
IF OBJECT_ID(N'dbo.TEST', N'U') IS NOT NULL
BEGIN
DROP TABLE dbo.TEST;
END;
GO
CREATE TABLE dbo.TEST (
ID int NOT NULL CONSTRAINT PK_TEST_ID PRIMARY KEY CLUSTERED,
CATEGORY varchar(10),
CAT_LEN AS LEN(CATEGORY)
);
*/
DECLARE @SAMPLE_SIZE AS int = 2000000;
DECLARE @MAX_CHARS AS int = 5;
DECLARE @INT_BUCKET AS int = 0;
DECLARE @CHAR_BUCKET AS varchar(10) = '';
DECLARE @BATCH_ID AS int = 1;
DECLARE @timer AS TABLE (
T_TXT varchar(30) NOT NULL,
T_TS datetime2(7) NOT NULL DEFAULT (SYSDATETIME()),
T_BATCH int NOT NULL
);
/*
INSERT INTO TEST (ID, CATEGORY)
VALUES (1,'5HW6'),
(2,'256'),
(3,'57'),
(4,'3'),
(5,'4578'),
(6,'256HW');
SELECT *
FROM dbo.TEST;
*/
/*
WITH T(N) AS (
SELECT X.N
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N)
),
NUMS(N) AS (
SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9
)
INSERT INTO dbo.TEST (ID, CATEGORY)
SELECT
NM.N,
SUBSTRING(REPLACE(CONVERT(varchar(36), NEWID(), 0), CHAR(45), CHAR(65)), 1, (ABS(CHECKSUM(NEWID())) % @MAX_CHARS) + 1)
FROM NUMS AS NM;
*/
WHILE @BATCH_ID <= 10
BEGIN
INSERT INTO @timer(T_TXT, T_BATCH) VALUES('NON-DELIMITED-SPLIT8K', @BATCH_ID);
SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = S.Item
FROM dbo.TEST AS T
OUTER APPLY dbo.NonDelimitedSplit8K(T.CATEGORY) AS S
INSERT INTO @timer(T_TXT, T_BATCH) VALUES('NON-DELIMITED-SPLIT8K', @BATCH_ID);
INSERT INTO @timer(T_TXT, T_BATCH) VALUES('EE OUTER APPLY', @BATCH_ID);
SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = SUBSTRING(T.CATEGORY, S.N, 1)
FROM dbo.TEST AS T
OUTER APPLY (SELECT TOP (LEN(T.CATEGORY)) X.N FROM (VALUES (1),(2),(3),(4),(5)) AS X(N)) AS S
INSERT INTO @timer(T_TXT, T_BATCH) VALUES('EE OUTER APPLY', @BATCH_ID);
INSERT INTO @timer(T_TXT, T_BATCH) VALUES('EE OUTER APPLY SGM', @BATCH_ID);
SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = SUBSTRING(T.CATEGORY, S.N, 1)
FROM dbo.TEST AS T
OUTER APPLY (SELECT TOP (T.CAT_LEN) X.N FROM (VALUES (1),(2),(3),(4),(5)) AS X(N)) AS S
INSERT INTO @timer(T_TXT, T_BATCH) VALUES('EE OUTER APPLY SGM', @BATCH_ID);
SELECT @BATCH_ID = @BATCH_ID + 1;
END;
SELECT
T.T_BATCH,
T.T_TXT,
DATEDIFF(MICROSECOND, MIN(T.T_TS), MAX(T.T_TS)) AS DURATION
FROM @timer AS T
GROUP BY T_BATCH, T.T_TXT
ORDER BY T_BATCH, DURATION ASC;
--DROP TABLE TEST;
--TRUNCATE TABLE dbo.TEST;
And finally, the results:
T_BATCH T_TXT DURATION
1 NON-DELIMITED-SPLIT8K 15600
1 EE OUTER APPLY 46800
1 EE OUTER APPLY SGM 46800
2 EE OUTER APPLY 46800
2 NON-DELIMITED-SPLIT8K 46800
2 EE OUTER APPLY SGM 62400
3 NON-DELIMITED-SPLIT8K 62400
3 EE OUTER APPLY 63200
3 EE OUTER APPLY SGM 100600
4 EE OUTER APPLY SGM 93600
4 NON-DELIMITED-SPLIT8K 93600
4 EE OUTER APPLY 109200
5 EE OUTER APPLY 109200
5 NON-DELIMITED-SPLIT8K 109200
5 EE OUTER APPLY SGM 124800
6 EE OUTER APPLY 124800
6 NON-DELIMITED-SPLIT8K 124800
6 EE OUTER APPLY SGM 140400
7 NON-DELIMITED-SPLIT8K 140400
7 EE OUTER APPLY SGM 156000
7 EE OUTER APPLY 156000
8 EE OUTER APPLY 171600
8 NON-DELIMITED-SPLIT8K 171600
8 EE OUTER APPLY SGM 187200
9 NON-DELIMITED-SPLIT8K 187200
9 EE OUTER APPLY 202800
9 EE OUTER APPLY SGM 249600
10 EE OUTER APPLY SGM 218400
10 EE OUTER APPLY 218400
10 NON-DELIMITED-SPLIT8K 218400
Just not sure why I get such large differences in the timings.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
Viewing 12 posts - 16 through 26 (of 26 total)
You must be logged in to reply to this topic. Login to reply