October 21, 2016 at 6:24 am
I want to improve the following statement:
select * into Lines_410
from Lines_100 A join Ranges on a.linenumber >= Low and a.linenumber <= High
-- Number of rows
-- Lines_100 300000
-- Ranges 5000
-- Lines_410 90000
-- (The average rangelength is 18, ranges do not overlap so each line can only be in one range).
-- The sizes are from a test dataset, the actual dataset is much larger. (But less practical to test).
--
-- Code to be improved.
--
select * into Lines_410
from Lines_100 A join Ranges on a.linenumber >= Low and a.linenumber <= High
--
-- Table Lines_100
--
CREATE TABLE [dbo].[Lines_100](
[LINENUMBER] [bigint] NULL,
[Other_fields] [varchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
--
-- Table Ranges
--
CREATE TABLE [dbo].[Ranges](
[Low] [bigint] NULL,
[High] [bigint] NULL
) ON [PRIMARY]
-- Spoiler alert !
-- Spoiler alert !
-- Spoiler alert !
-- Spoiler alert !
-- Spoiler alert !
-- Spoiler alert !
-- Used indexes :
CREATE UNIQUE CLUSTERED INDEX IX_Lines_100 ON dbo.Lines_100
(
Linenumber
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE UNIQUE CLUSTERED INDEX IX_Ranges ON dbo.Ranges(
Low
, High
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
-- Runtimes seconds
-- No index. 118.12
-- Index lines_100. 118.176666666667
-- Index lines_100 / Ranges. 225.18
-- Index lines_100 / Ranges. 232.663333333333
-- Index Ranges. 222.606666666667 After the drop of index on lines_100
-- Index Ranges. 220.356666666667 After rebuild of lines_100 (no index).
-- No index. 118.056666666667 Drop index op Ranges.
-- Index lines_100 119.946666666667 Rebuild of the index
Removing the index from the ranges table (or not adding it in the first place) has a surprising influence on the performance.
Performance is 'far' worse with the index, than without the index.
The question is : Why ?
--
-- Stef ten Bras
-- Aka Ben Brugman
--
A similar statement to be improved is:
delete g from Lines_100 G join Ranges on G.linenumber >= van and g.LineNumber <=einde
Thanks, any advise to improve the performance is welcome.
Ben
October 21, 2016 at 7:53 am
The guesses made by the optimiser are whacked somewhat by the ranges. Try forcing a loop join reading first from the ranges table like this:
SELECT *
FROM #Ranges r
INNER loop JOIN #Lines_100 l
ON l.linenumber >= Low
AND l.linenumber <= High
-- (90019 row(s) affected) / 00:00:17
If anyone else wants to play, here's a test setup:
---------------------------------------
-- Table [Lines_100]
---------------------------------------
IF 0=1 BEGIN
DROP TABLE #Lines_100
CREATE TABLE #Lines_100(
[LINENUMBER] [bigint] NULL,
[Other_fields] [varchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
INSERT INTO #Lines_100 (LINENUMBER, Other_fields)
SELECT
LINENUMBER = d.NewNumber,
Other_fields = REPLICATE(CAST(NEWID() AS VARCHAR(36)) + '&',d.NewNumber%100)
FROM (
SELECT TOP(300000) NewNumber FROM (SELECT TOP(350000) NewNumber = ABS(CHECKSUM(NEWID())) FROM sys.columns a CROSS JOIN sys.columns b) d GROUP BY NewNumber
) d
CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #Lines_100 (LINENUMBER)
CREATE UNIQUE NONCLUSTERED INDEX uix_Stuff ON #Lines_100 (LINENUMBER) -- Benchmark, select only LINENUMBER in query
---------------------------------------
-- Table [Ranges]
---------------------------------------
DROP TABLE #Ranges
CREATE TABLE #Ranges(
[Low] [bigint] NULL,
[High] [bigint] NULL
) ON [PRIMARY]
INSERT INTO #Ranges (Low, High)
SELECT TOP(90000) LINENUMBER-1, LINENUMBER+1 FROM #Lines_100 ORDER BY Other_fields
CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #Ranges ([Low], [High])
END
-----------------------------------------------------------------------------------------
SELECT *
FROM #Ranges r
INNER loop JOIN #Lines_100 l
ON l.linenumber >= Low
AND l.linenumber <= High
-- (90019 row(s) affected) / 00:00:17
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 21, 2016 at 9:28 am
The problem with ranges is that it requires two range predicates (>=, >, <=, or <), but SQL indexes only efficiently handle one range predicate. There is a way to convert this to an equality predicate and a range predicate, but it's very mathematical. It will take me awhile to write up this solution. It also requires a computed column on your ranges table. It may be moot if you can't add that computed column.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 21, 2016 at 9:54 am
drew.allen (10/21/2016)
The problem with ranges is that it requires two range predicates (>=, >, <=, or <), but SQL indexes only efficiently handle one range predicate. There is a way to convert this to an equality predicate and a range predicate, but it's very mathematical. It will take me awhile to write up this solution. It also requires a computed column on your ranges table. It may be moot if you can't add that computed column.Drew
Very interested to see this Drew. I usually make do with one column keyed (always check both bounds of the range) and the other INCLUDEd - you get a seek on one and a residual on the other.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 21, 2016 at 12:10 pm
Unfortunately, I'm not sure that it will work in this case, because the article does calculations on constants, but reworking it for this, would require doing those same calculations on a field, and thus, it's no longer SARGable. I'll keep working on it, but I don't see a way around it, yet.
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 21, 2016 at 2:48 pm
Another way to try is to unpack the ranges into their constituent values and then do a join based on equality to those values.
Sometimes that's slower, but at least with Chris' sample data setup, it is a bit faster on my machine than doing the join with order enforced, a bit lighter on CPU, and a lot lighter on reads.
query runs duration cpu reads
---------------------- ---- --------- -------- -------
Test: UNPACKED QUERY 10 8299557 8290538 816890
Test: RANGE JOIN QUERY 10 10561492 10555492 4371355
You pay some penalty for doing the unpacking and sorting the results, but at least on my machine with Chris' data that's outweighed by getting to use a merge join.
Here's the code I ran after using Chris' sample setup:
GO --End Chris' setup batch
EXEC(
'--Test: UNPACKED QUERY
DECLARE @bigint_bucket BIGINT,
@varchar_bucket VARCHAR(MAX);
SELECT @bigint_bucket=Low+RN, @bigint_bucket=Low, @bigint_bucket=High,@bigint_bucket=LINENUMBER,@varchar_bucket=other_fields
FROM #Ranges R
CROSS APPLY (SELECT TOP ((High-Low)+1) RN=ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1
FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n1(n)
CROSS JOIN
(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n2(n)
CROSS JOIN
(VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n3(n)
) tally
INNER JOIN
#Lines_100 L ON L.LINENUMBER=Low+RN;
');
EXEC(
'--Test: RANGE JOIN QUERY
DECLARE @bigint_bucket BIGINT,
@varchar_bucket VARCHAR(MAX);
SELECT @bigint_bucket=LINENUMBER,
@bigint_bucket=low,
@bigint_bucket=High,
@varchar_bucket=Other_fields
FROM #Ranges r
INNER LOOP JOIN
#Lines_100 l
ON l.linenumber >= Low
AND
l.linenumber <= High;
');
GO 10 --Run each query 10 times.
SELECT query=SUBSTRING(qt.text,3,CHARINDEX('QUERY',qt.text)+2),
runs=qs.execution_count,
duration=total_elapsed_time,
cpu=total_worker_time,
reads=total_logical_reads
FROM sys.dm_exec_query_stats qs
CROSS APPLY
sys.dm_exec_sql_text(qs.sql_handle) qt
WHERE qt.text LIKE '--Test%'
ORDER BY total_elapsed_time ASC
;
Cheers!
EDIT: Fixed some of the code formatting. Using a new SSMS and haven't switched the settings around tabs/spaces, so it didn't paste very well.
I should also note that as written the unpacked query will handle ranges of up to 1000. If ranges are larger, the tally section will have to be adjusted. Performance is fairly sensitive to that, which is likely one reason it does relatively well on Chris' sample data with its very small ranges.
I'll probably try in a bit with larger ranges to see where the break even point is. I imagine it'll be pretty low.
EDIT 2: As suspected, the cost of sorting gets prohibitive pretty quickly as you increase the range size. The break even point is at an average range size of 5 on my machine. With the average range size of 18 you mentioned, the only way to really make something like this feasible would be if the Ranges table is not updated frequently.
In that case, you could probably maintain the unpacked list either as an indexed view (although the unpacking would have to be done differently in that case) or as a separate table, and just join the Line table to that.
Even that might not be much faster, though.
October 24, 2016 at 4:04 am
ChrisM@Work (10/21/2016)
The guesses made by the optimiser are whacked somewhat by the ranges.
Yes they are WHacked.
Try forcing a loop join reading first from the ranges table like this:
SELECT *
FROM #Ranges r
INNER loop JOIN #Lines_100 l
ON l.linenumber >= Low
AND l.linenumber <= High
-- (90019 row(s) affected) / 00:00:17
THANKS.
The improvement on the specific statements is Huge.
Also thanks for the example, although it does not actually match the situation, it does illustrate the working of the statement.
(My code performs extremely bad with the example code. I did not finish the test.)
Thanks for this huge improvement and educating me,
Ben
Because of changes in the testset, I can not reproduce the previous testtimes.
But for the currect set the suggested (forum) code (with indexes) outperforms my previous code (no indexes) with about a factor 25.
October 31, 2016 at 2:57 am
ben.brugman (10/24/2016)
ChrisM@Work (10/21/2016)
The guesses made by the optimiser are whacked somewhat by the ranges.Yes they are WHacked.
Try forcing a loop join reading first from the ranges table like this:
SELECT *
FROM #Ranges r
INNER loop JOIN #Lines_100 l
ON l.linenumber >= Low
AND l.linenumber <= High
-- (90019 row(s) affected) / 00:00:17
THANKS.
The improvement on the specific statements is Huge.
Also thanks for the example, although it does not actually match the situation, it does illustrate the working of the statement.
(My code performs extremely bad with the example code. I did not finish the test.)
Thanks for this huge improvement and educating me,
Ben
Because of changes in the testset, I can not reproduce the previous testtimes.
But for the currect set the suggested (forum) code (with indexes) outperforms my previous code (no indexes) with about a factor 25.
Hi Ben, thanks for your generous feedback and humble apologies for not responding sooner - the sky fell down this month.
In your shoes, I'd now look at encouraging the optimiser to use exactly this join but without the hint, which very generally means investigating the two read points to assess what indexes might be optimal then testing them out. The benefit is that if the metrics of the two tables involved change to the point that this plan is no longer ideal, the optimiser won't be constrained by the hint.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply