May 17, 2018 at 5:23 am
I have one "big" table with 300 million rows... CREATE TABLE rstore.OutputDataset(OutputDatasetId INT NOT NULL IDENTITY(1,1)
,OutputDatasetHeaderKey INT NOT NULL
,DateKey INT NOT NULL
,SafeRegionKey SMALLINT NOT NULL
,SafeCountryKey SMALLINT NOT NULL
,Attribute1ValueKey INT NULL
,Attribute2ValueKey INT NULL
,Attribute3ValueKey INT NULL
,Attribute4ValueKey INT NULL
,Attribute5ValueKey INT NULL
,Attribute6ValueKey INT NULL
,Attribute7ValueKey INT NULL
,Attribute8ValueKey INT NULL
,Attribute9ValueKey INT NULL
,Attribute10ValueKey INT NULL
,SalesUnits DECIMAL(19, 4) NOT NULL
,SalesValue DECIMAL(19, 4) NOT NULL
,PriorSalesUnits DECIMAL(19, 4) NOT NULL DEFAULT 0
,PriorSalesValue DECIMAL(19, 4) NOT NULL DEFAULT 0
,DataStatusTypeKey TINYINT NOT NULL
,CONSTRAINT PK_OutputDataset PRIMARY KEY NONCLUSTERED(OutputDatasetId)
,CONSTRAINT FK_OutputDataset_... quite a few foreign keys
);
BUT a beautiful clustered index that reduces the number of rows fetched to ~300,000 (there is no non clustered index... since my clustered index is so big anyway)CREATE UNIQUE CLUSTERED INDEX UCIX_OutputDataSet ON rstore.OutputDataset(OutputDatasetHeaderKey
,DateKey, SafeRegionKey, SafeCountryKey
,Attribute1ValueKey ,Attribute2ValueKey ,Attribute3ValueKey ,Attribute4ValueKey ,Attribute5ValueKey ,Attribute6ValueKey ,Attribute7ValueKey ,Attribute8ValueKey ,Attribute9ValueKey ,Attribute10ValueKey
);
ALL our queries specify the column OutputDatasetHeaderKey (which shrinks the number of rows by a lot), usually followed by a DateKey range, and most of the time by the Region Key, etc
The attached query is disappointingly slow (~11 seconds on my test server)
The two temp tables get loaded with about 10,000 rows each, half of them useful.
The main table retrieves about 300.000 rows through the clustered index.
In my particular case, the columns Attribute1Key to Attribute5Key have values, therefore the first 9 columns of the clustered index are not null.
What I don't understand is that the optimiser does not seem to be able to use the fact that all three clustered indexes have similar columns in the same order.
The temp tables, the clustered PK is DateKey, SafeRegionKey and Attribute3Key, all of which appear in my main table clustered index.
Although I don't filter my main table on SafeRegionKey, this column is still NOT NULL and part of the clustered index...
Not quite sure how to improve this query...
As a side effect, if I specify a SafeRegionKey to filter on, then everything seems fine... (the last statement is actually a dynamic SQL I generate and execute through sp_executesql)
I can see that after the merge between OutputDataset and #GrossupUnit, ~16 million rows have been read... but no idea how I can stop it :crazy:
Any suggestions?
Thanks
Eric
May 17, 2018 at 5:45 am
I very much doubt you need all of those columns in your Clustered Index. Have a look at this article from Redgate: https://www.red-gate.com/simple-talk/sql/learn-sql-server/effective-clustered-indexes/.
Notice, the description says a Clustered index should be Unique, Static, Ever-Increasing and (firstly) narrow. 14 columns (if i counted correctly) is far too many columns for a clustered index, and I doubt that those values are Static either.
The first step is reviewing your Clustered Index, and the next will be what other indexes will be beneficial for your table.
Also, the fact that you have columns AttributeKey1 to Attribute10 implies you have de-normalised data. This isn't going to help for performance matters either; I strongly suggest reconsidering your data structure there and using a normalised format.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
May 17, 2018 at 6:13 am
Why not grab those two lookups into one? Other than that, what Thom said. The clustered index is far too wide. Don't forget that any ordinary index on the same table will contain the keys of the clustered index.
SELECT
DateKey,
SafeRegionKey,
SafeFormFactorKey,
GrossupUnit = MAX(CASE WHEN T.MetadataTypeName = 'Grossup Unit' THEN MetadataValue ELSE NULL END),
GrossupValue = MAX(CASE WHEN T.MetadataTypeName = 'Grossup Value' THEN MetadataValue ELSE NULL END)
INTO #GrossupUnitValue
FROM rstore.DPAPMetadata M
INNER JOIN rstore.MetadataType T
ON T.MetadataTypeId = M.MetadataTypeKey
AND T.MetadataTypeName IN ('Grossup Unit','Grossup Value')
WHERE RReleaseKey = @RReleaseKey
GROUP BY DateKey, SafeRegionKey, SafeFormFactorKey
CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #GrossupUnitValue (DateKey, SafeRegionKey, SafeFormFactorKey)
SELECT DS.DateKey
,SalesUnits = SUM(SalesUnits * ISNULL(GUV.GrossupUnit, 1.0))
,SalesValue = SUM(SalesValue * ISNULL(GUV.GrossupValue, 1.0))
,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GUV.GrossupUnit, 1.0))
,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GUV.GrossupValue, 1.0))
FROM rstore.OutputDataset DS
LEFT JOIN #GrossupUnitValue GUV
ON GUV.DateKey = DS.DateKey
AND GUV.SafeRegionKey = DS.SafeRegionKey
AND GUV.Attribute3Key = DS.Attribute3ValueKey
WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
AND DS.DateKey >= @DateIdFrom
AND DS.DateKey <= @DateIdTo
GROUP BY DS.DateKey
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
May 17, 2018 at 6:22 am
The idea to join these two lookups is interesting indeed.
I don't think it will help in this immediate problem but it's certainly something I should do.
May 17, 2018 at 6:23 am
Another way of writing the query:;WITH DS AS
(
SELECT DISTINCT
DS.DateKey,
DS.SafeRegionKey,
DS.Attribute3ValueKey
FROM rstore.OutputDataset DS
WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
AND DS.DateKey >= @DateIdFrom
AND DS.DateKey <= @DateIdTo
),
GU AS
(
SELECT GU.DateKey,
SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0)) SalesUnits,
SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0)) PriorSalesUnits
FROM DS
INNER JOIN #GrossupUnit GU
ON GU.DateKey = DS.DateKey
AND GU.SafeRegionKey = DS.SafeRegionKey
AND GU.Attribute3ValueKey = DS.SafeRegionKey
),
GV AS
(
SELECT GV.DateKey,
SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0)) SalesValue,
SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0)) PriorSalesValue
FROM DS
INNER JOIN #GrossupValue GV
ON GV.DateKey = DS.DateKey
AND GV.SafeRegionKey = DS.SafeRegionKey
AND GV.Attribute3ValueKey = DS.SafeRegionKey
)
SELECT DS.DateKey,
ISNULL(Units.SalesUnits,0) SalesUnits,
ISNULL(SalesValue.SalesValue,0) SalesValue,
ISNULL(Units.PriorSalesUnits,0) PriorSalesUnits,
ISNULL(SalesValue.PriorSalesValue,0) PriorSalesValue
FROM DS
LEFT JOIN GU
ON GU.DateKey = GS.DateKey
LEFT JOIN SalesValue
ON SU.DateKey = DS.DateKey
You might get even better performance if you create a new temporary table from the query: SELECT DISTINCT
DS.DateKey,
DS.SafeRegionKey,
DS.Attribute3ValueKey
INTO #DistinctOutputDataset
FROM rstore.OutputDataset DS
WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
AND DS.DateKey >= @DateIdFrom
AND DS.DateKey <= @DateIdTo
(with appropriate indexes)
and use it to replace the "DS" CTE
May 17, 2018 at 6:37 am
Eric Mamet - Thursday, May 17, 2018 6:22 AMThe idea to join these two lookups is interesting indeed.
I don't think it will help in this immediate problem but it's certainly something I should do.
The individual queries both use DISTINCT. Aggregating the extra column in my version won't add much to the total cost of one of the original queries. I'd expect the overall cost to be around half that of the original batch, and the third query is reduced in complexity. Generally this helps to explore tuning possibilities.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
May 17, 2018 at 6:43 am
Thom A - Thursday, May 17, 2018 5:45 AMI very much doubt you need all of those columns in your Clustered Index. Have a look at this article from Redgate: https://www.red-gate.com/simple-talk/sql/learn-sql-server/effective-clustered-indexes/.Notice, the description says a Clustered index should be Unique, Static, Ever-Increasing and (firstly) narrow. 14 columns (if i counted correctly) is far too many columns for a clustered index, and I doubt that those values are Static either.
The first step is reviewing your Clustered Index, and the next will be what other indexes will be beneficial for your table.
Also, the fact that you have columns AttributeKey1 to Attribute10 implies you have de-normalised data. This isn't going to help for performance matters either; I strongly suggest reconsidering your data structure there and using a normalised format.
HI Thom,
Thanks for your answer and the interesting link.
I think I am quite familiar with all this but, as often, there are exceptions to general rules.
In fact, my clustered index is not far from the recommendations in that article.
It all works pretty well because the first column in the clustered index is very selective (I don't have more than 600,000 rows for any value).
The query I refer to is the unfortunate exception...
May 17, 2018 at 7:31 am
Jonathan AC Roberts - Thursday, May 17, 2018 6:23 AMAnother way of writing the query:;WITH DS AS
(
SELECT DISTINCT
DS.DateKey,
DS.SafeRegionKey,
DS.Attribute3ValueKey
FROM rstore.OutputDataset DS
WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
AND DS.DateKey >= @DateIdFrom
AND DS.DateKey <= @DateIdTo
),
GU AS
(
SELECT GU.DateKey,
SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0)) SalesUnits,
SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0)) PriorSalesUnits
FROM DS
INNER JOIN #GrossupUnit GU
ON GU.DateKey = DS.DateKey
AND GU.SafeRegionKey = DS.SafeRegionKey
AND GU.Attribute3ValueKey = DS.SafeRegionKey
),
GV AS
(
SELECT GV.DateKey,
SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0)) SalesValue,
SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0)) PriorSalesValue
FROM DS
INNER JOIN #GrossupValue GV
ON GV.DateKey = DS.DateKey
AND GV.SafeRegionKey = DS.SafeRegionKey
AND GV.Attribute3ValueKey = DS.SafeRegionKey
)
SELECT DS.DateKey,
ISNULL(Units.SalesUnits,0) SalesUnits,
ISNULL(SalesValue.SalesValue,0) SalesValue,
ISNULL(Units.PriorSalesUnits,0) PriorSalesUnits,
ISNULL(SalesValue.PriorSalesValue,0) PriorSalesValue
FROM DS
LEFT JOIN GU
ON GU.DateKey = GS.DateKey
LEFT JOIN SalesValue
ON SU.DateKey = DS.DateKeyYou might get even better performance if you create a new temporary table from the query:
SELECT DISTINCT
DS.DateKey,
DS.SafeRegionKey,
DS.Attribute3ValueKey
INTO #DistinctOutputDataset
FROM rstore.OutputDataset DS
WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
AND DS.DateKey >= @DateIdFrom
AND DS.DateKey <= @DateIdTo
(with appropriate indexes)
and use it to replace the "DS" CTE
Very interesting idea!
I'll have to give that a shot as well
Many Thanks
May 17, 2018 at 7:33 am
ChrisM@Work - Thursday, May 17, 2018 6:37 AMEric Mamet - Thursday, May 17, 2018 6:22 AMThe idea to join these two lookups is interesting indeed.
I don't think it will help in this immediate problem but it's certainly something I should do.The individual queries both use DISTINCT. Aggregating the extra column in my version won't add much to the total cost of one of the original queries. I'd expect the overall cost to be around half that of the original batch, and the third query is reduced in complexity. Generally this helps to explore tuning possibilities.
I get your point.
According to the query plan, it looks like it's the first merging between one of the temp table and the big table so I would not expect it to split the cost in half.
However, it's more elegant and simpler so definitely worth a shot.
Thank you very much folks, I did not expect such quick and good answers.
VERY appreciated
Eric
May 17, 2018 at 7:49 am
One thing I noticed in the execution plan is the residual IO. Actual rows is 321,374 but the number of rows read is 15,747,409. Seeing this with four of the operators in the execution plan. This could be slowing it down. But then again I could be wrong.
May 17, 2018 at 7:53 am
Just curious, but looking at the execution plans for the queries populating the temp tables, is the DISTINCT actually needed? The number of rows entering and leaving the sort operator ar the same.
May 17, 2018 at 8:20 am
Lynn Pettis - Thursday, May 17, 2018 7:49 AMOne thing I noticed in the execution plan is the residual IO. Actual rows is 321,374 but the number of rows read is 15,747,409. Seeing this with four of the operators in the execution plan. This could be slowing it down. But then again I could be wrong.
I think you are absolutely right and it is precisely what I am after.
I suspect the various ideas I was given above might help.
Unfortunately, I vastly simplified the initial query context because some of that code is generated on the fly (hence the use of sp_executesql) so I'll have to be a bit careful how I change the thing.
May 17, 2018 at 8:24 am
Your clustered index is fine.
You really don't need the temp tables for this specific query, but if building them is not taking much time, then that's OK too.
But in the final query, for this specific query, since you're only grouping by DateKey, you can pre-summarize the temp tables and avoid a lot of JOIN overhead. Since this may be a generic process, not sure how well you'll able to adjust the code that gens the query to do this.
SELECT DS.DateKey
,SalesUnits = SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0))
,SalesValue = SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0))
,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0))
,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0))
FROM rstore.OutputDataset DS
LEFT JOIN (
SELECT DateKey, SUM(GrossupUnit) AS GrossupUnit
FROM #GrossupUnit
GROUP BY DateKey
) AS GU ON GU.DateKey = DS.DateKey
LEFT JOIN (
SELECT DateKey, SUM(GrossupValue) AS GrossupValue
FROM #GrossupValue
GROUP BY DateKey
) AS GV ON GV.DateKey = DS.DateKey
WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
AND DS.DateKey >= @DateIdFrom
AND DS.DateKey <= @DateIdTo
GROUP BY DS.DateKey
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
May 17, 2018 at 8:44 am
ScottPletcher - Thursday, May 17, 2018 8:24 AMYour clustered index is fine.
You really don't need the temp tables for this specific query, but if building them is not taking much time, then that's OK too.But in the final query, for this specific query, since you're only grouping by DateKey, you can pre-summarize the temp tables and avoid a lot of JOIN overhead. Since this may be a generic process, not sure how well you'll able to adjust the code that gens the query to do this.
SELECT DS.DateKey
,SalesUnits = SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0))
,SalesValue = SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0))
,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0))
,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0))
FROM rstore.OutputDataset DS
LEFT JOIN (
SELECT DateKey, SUM(GrossupUnit) AS GrossupUnit
FROM #GrossupUnit
GROUP BY DateKey
) AS GU ON GU.DateKey = DS.DateKey
LEFT JOIN (
SELECT DateKey, SUM(GrossupValue) AS GrossupValue
FROM #GrossupValue
GROUP BY DateKey
) AS GV ON GV.DateKey = DS.DateKey
WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
AND DS.DateKey >= @DateIdFrom
AND DS.DateKey <= @DateIdTo
GROUP BY DS.DateKey
I think you are right but the "real" code is actually generic so I don't know in advance what filters I'll have AND what aggregation will be required.
However, I only have a performance problem in this particular scenario (as far as I can tell) so I might try to create a specific sub procedure to implement just this particular case.
This would allow me to make some assumptions and take into consideration the various advice I just got.
I also have occasional problems with re-compilations so having specialised sub-procedure(s) is probably useful in that sense too.
May 17, 2018 at 9:00 am
Lynn Pettis - Thursday, May 17, 2018 7:53 AMJust curious, but looking at the execution plans for the queries populating the temp tables, is the DISTINCT actually needed? The number of rows entering and leaving the sort operator ar the same.
Well spotted!
The DISTINCTs are necessary because whoever designed these tables made a mistake and we store duplicates!
Not much I can do about this one... I'd like to redesign these tables but not sure whether this will ever happen... :Whistling:
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply