Why is this query so slow?

Question

Why is this query so slow?

Eric Mamet

SSChampion

Points: 11728
More actions
May 17, 2018 at 5:23 am

#339428

I have one "big" table with 300 million rows...
CREATE TABLE rstore.OutputDataset(OutputDatasetId INT NOT NULL IDENTITY(1,1) ,OutputDatasetHeaderKey INT NOT NULL ,DateKey INT NOT NULL ,SafeRegionKey SMALLINT NOT NULL ,SafeCountryKey SMALLINT NOT NULL ,Attribute1ValueKey INT NULL ,Attribute2ValueKey INT NULL ,Attribute3ValueKey INT NULL ,Attribute4ValueKey INT NULL ,Attribute5ValueKey INT NULL ,Attribute6ValueKey INT NULL ,Attribute7ValueKey INT NULL ,Attribute8ValueKey INT NULL ,Attribute9ValueKey INT NULL ,Attribute10ValueKey INT NULL ,SalesUnits DECIMAL(19, 4) NOT NULL ,SalesValue DECIMAL(19, 4) NOT NULL ,PriorSalesUnits DECIMAL(19, 4) NOT NULL DEFAULT 0 ,PriorSalesValue DECIMAL(19, 4) NOT NULL DEFAULT 0 ,DataStatusTypeKey TINYINT NOT NULL ,CONSTRAINT PK_OutputDataset PRIMARY KEY NONCLUSTERED(OutputDatasetId) ,CONSTRAINT FK_OutputDataset_... quite a few foreign keys );
BUT a beautiful clustered index that reduces the number of rows fetched to ~300,000 (there is no non clustered index... since my clustered index is so big anyway)
CREATE UNIQUE CLUSTERED INDEX UCIX_OutputDataSet ON rstore.OutputDataset(OutputDatasetHeaderKey ,DateKey, SafeRegionKey, SafeCountryKey ,Attribute1ValueKey ,Attribute2ValueKey ,Attribute3ValueKey ,Attribute4ValueKey ,Attribute5ValueKey ,Attribute6ValueKey ,Attribute7ValueKey ,Attribute8ValueKey ,Attribute9ValueKey ,Attribute10ValueKey );
ALL our queries specify the column OutputDatasetHeaderKey (which shrinks the number of rows by a lot), usually followed by a DateKey range, and most of the time by the Region Key, etc
The attached query is disappointingly slow (~11 seconds on my test server)
The two temp tables get loaded with about 10,000 rows each, half of them useful.
The main table retrieves about 300.000 rows through the clustered index.
In my particular case, the columns Attribute1Key to Attribute5Key have values, therefore the first 9 columns of the clustered index are not null.
What I don't understand is that the optimiser does not seem to be able to use the fact that all three clustered indexes have similar columns in the same order.
The temp tables, the clustered PK is DateKey, SafeRegionKey and Attribute3Key, all of which appear in my main table clustered index.
Although I don't filter my main table on SafeRegionKey, this column is still NOT NULL and part of the clustered index...
Not quite sure how to improve this query...
As a side effect, if I specify a SafeRegionKey to filter on, then everything seems fine... (the last statement is actually a dynamic SQL I generate and execute through sp_executesql)
I can see that after the merge between OutputDataset and #GrossupUnit, ~16 million rows have been read... but no idea how I can stop it :crazy:
Any suggestions?
Thanks
Eric

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Thom A SSC Guru Points: 99133 More actions · Answer 1

I very much doubt you need all of those columns in your Clustered Index. Have a look at this article from Redgate: https://www.red-gate.com/simple-talk/sql/learn-sql-server/effective-clustered-indexes/.

Notice, the description says a Clustered index should be Unique, Static, Ever-Increasing and (firstly) narrow. 14 columns (if i counted correctly) is far too many columns for a clustered index, and I doubt that those values are Static either.

The first step is reviewing your Clustered Index, and the next will be what other indexes will be beneficial for your table.

Also, the fact that you have columns AttributeKey1 to Attribute10 implies you have de-normalised data. This isn't going to help for performance matters either; I strongly suggest reconsidering your data structure there and using a normalised format.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 2

Why not grab those two lookups into one? Other than that, what Thom said. The clustered index is far too wide. Don't forget that any ordinary index on the same table will contain the keys of the clustered index.

SELECT

DateKey,

SafeRegionKey,

SafeFormFactorKey,

GrossupUnit = MAX(CASE WHEN T.MetadataTypeName = 'Grossup Unit' THEN MetadataValue ELSE NULL END),

GrossupValue = MAX(CASE WHEN T.MetadataTypeName = 'Grossup Value' THEN MetadataValue ELSE NULL END)

INTO #GrossupUnitValue

FROM rstore.DPAPMetadata M

INNER JOIN rstore.MetadataType T

ON T.MetadataTypeId = M.MetadataTypeKey

AND T.MetadataTypeName IN ('Grossup Unit','Grossup Value')

WHERE RReleaseKey = @RReleaseKey

GROUP BY DateKey, SafeRegionKey, SafeFormFactorKey

CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #GrossupUnitValue (DateKey, SafeRegionKey, SafeFormFactorKey)

SELECT DS.DateKey

,SalesUnits = SUM(SalesUnits * ISNULL(GUV.GrossupUnit, 1.0))

,SalesValue = SUM(SalesValue * ISNULL(GUV.GrossupValue, 1.0))

,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GUV.GrossupUnit, 1.0))

,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GUV.GrossupValue, 1.0))

FROM rstore.OutputDataset DS

LEFT JOIN #GrossupUnitValue GUV

ON GUV.DateKey = DS.DateKey

AND GUV.SafeRegionKey = DS.SafeRegionKey

AND GUV.Attribute3Key = DS.Attribute3ValueKey

WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey

AND DS.DateKey >= @DateIdFrom

AND DS.DateKey <= @DateIdTo

GROUP BY DS.DateKey

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Eric Mamet SSChampion Points: 11728 More actions · Answer 3

The idea to join these two lookups is interesting indeed.
I don't think it will help in this immediate problem but it's certainly something I should do.

Jonathan AC Roberts SSCoach Points: 18361 More actions · Answer 4

Another way of writing the query:
;WITH DS AS ( SELECT DISTINCT DS.DateKey, DS.SafeRegionKey, DS.Attribute3ValueKey FROM rstore.OutputDataset DS WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey AND DS.DateKey >= @DateIdFrom AND DS.DateKey <= @DateIdTo ), GU AS ( SELECT GU.DateKey, SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0)) SalesUnits, SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0)) PriorSalesUnits FROM DS INNER JOIN #GrossupUnit GU ON GU.DateKey = DS.DateKey AND GU.SafeRegionKey = DS.SafeRegionKey AND GU.Attribute3ValueKey = DS.SafeRegionKey ), GV AS ( SELECT GV.DateKey, SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0)) SalesValue, SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0)) PriorSalesValue FROM DS INNER JOIN #GrossupValue GV ON GV.DateKey = DS.DateKey AND GV.SafeRegionKey = DS.SafeRegionKey AND GV.Attribute3ValueKey = DS.SafeRegionKey ) SELECT DS.DateKey, ISNULL(Units.SalesUnits,0) SalesUnits, ISNULL(SalesValue.SalesValue,0) SalesValue, ISNULL(Units.PriorSalesUnits,0) PriorSalesUnits, ISNULL(SalesValue.PriorSalesValue,0) PriorSalesValue FROM DS LEFT JOIN GU ON GU.DateKey = GS.DateKey LEFT JOIN SalesValue ON SU.DateKey = DS.DateKey

You might get even better performance if you create a new temporary table from the query:
SELECT DISTINCT DS.DateKey, DS.SafeRegionKey, DS.Attribute3ValueKey INTO #DistinctOutputDataset FROM rstore.OutputDataset DS WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey AND DS.DateKey >= @DateIdFrom AND DS.DateKey <= @DateIdTo
(with appropriate indexes)
and use it to replace the "DS" CTE

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

Eric Mamet - Thursday, May 17, 2018 6:22 AM
The idea to join these two lookups is interesting indeed.
I don't think it will help in this immediate problem but it's certainly something I should do.

The individual queries both use DISTINCT. Aggregating the extra column in my version won't add much to the total cost of one of the original queries. I'd expect the overall cost to be around half that of the original batch, and the third query is reduced in complexity. Generally this helps to explore tuning possibilities.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Eric Mamet SSChampion Points: 11728 More actions · Answer 6

Thom A - Thursday, May 17, 2018 5:45 AM
I very much doubt you need all of those columns in your Clustered Index. Have a look at this article from Redgate: https://www.red-gate.com/simple-talk/sql/learn-sql-server/effective-clustered-indexes/.
Notice, the description says a Clustered index should be Unique, Static, Ever-Increasing and (firstly) narrow. 14 columns (if i counted correctly) is far too many columns for a clustered index, and I doubt that those values are Static either.
The first step is reviewing your Clustered Index, and the next will be what other indexes will be beneficial for your table.
Also, the fact that you have columns AttributeKey1 to Attribute10 implies you have de-normalised data. This isn't going to help for performance matters either; I strongly suggest reconsidering your data structure there and using a normalised format.

HI Thom,

Thanks for your answer and the interesting link.

I think I am quite familiar with all this but, as often, there are exceptions to general rules.
In fact, my clustered index is not far from the recommendations in that article.

Narrow: Ok, I fail on this one... BUT I do not want any non clustered index, therefore it does not matter so much
Unique: Yes
Static: Yes, actually, we don't ever update these rows
Ever-increasing : Yes, we "could" delete an entire batch of rows (identified by the first column OutputDatasetHeaderKey) but it remains relatively exceptional

The reason for these 10 generic attributes is that the content of the rows will depend on the type of data we store in there.
The type of data stored is indirectly indicated by the value of the first column (OutputDatasetHeaderKey).

It all works pretty well because the first column in the clustered index is very selective (I don't have more than 600,000 rows for any value).
The query I refer to is the unfortunate exception...

Eric Mamet SSChampion Points: 11728 More actions · Answer 7

Jonathan AC Roberts - Thursday, May 17, 2018 6:23 AM
Another way of writing the query:
;WITH DS AS ( SELECT DISTINCT DS.DateKey, DS.SafeRegionKey, DS.Attribute3ValueKey FROM rstore.OutputDataset DS WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey AND DS.DateKey >= @DateIdFrom AND DS.DateKey <= @DateIdTo ), GU AS ( SELECT GU.DateKey, SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0)) SalesUnits, SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0)) PriorSalesUnits FROM DS INNER JOIN #GrossupUnit GU ON GU.DateKey = DS.DateKey AND GU.SafeRegionKey = DS.SafeRegionKey AND GU.Attribute3ValueKey = DS.SafeRegionKey ), GV AS ( SELECT GV.DateKey, SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0)) SalesValue, SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0)) PriorSalesValue FROM DS INNER JOIN #GrossupValue GV ON GV.DateKey = DS.DateKey AND GV.SafeRegionKey = DS.SafeRegionKey AND GV.Attribute3ValueKey = DS.SafeRegionKey ) SELECT DS.DateKey, ISNULL(Units.SalesUnits,0) SalesUnits, ISNULL(SalesValue.SalesValue,0) SalesValue, ISNULL(Units.PriorSalesUnits,0) PriorSalesUnits, ISNULL(SalesValue.PriorSalesValue,0) PriorSalesValue FROM DS LEFT JOIN GU ON GU.DateKey = GS.DateKey LEFT JOIN SalesValue ON SU.DateKey = DS.DateKey
You might get even better performance if you create a new temporary table from the query:
SELECT DISTINCT DS.DateKey, DS.SafeRegionKey, DS.Attribute3ValueKey INTO #DistinctOutputDataset FROM rstore.OutputDataset DS WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey AND DS.DateKey >= @DateIdFrom AND DS.DateKey <= @DateIdTo
(with appropriate indexes)
and use it to replace the "DS" CTE

Very interesting idea!
I'll have to give that a shot as well

Many Thanks

Eric Mamet SSChampion Points: 11728 More actions · Answer 8

ChrisM@Work - Thursday, May 17, 2018 6:37 AM
Eric Mamet - Thursday, May 17, 2018 6:22 AM
The idea to join these two lookups is interesting indeed.
I don't think it will help in this immediate problem but it's certainly something I should do.
The individual queries both use DISTINCT. Aggregating the extra column in my version won't add much to the total cost of one of the original queries. I'd expect the overall cost to be around half that of the original batch, and the third query is reduced in complexity. Generally this helps to explore tuning possibilities.

I get your point.
According to the query plan, it looks like it's the first merging between one of the temp table and the big table so I would not expect it to split the cost in half.
However, it's more elegant and simpler so definitely worth a shot.

Thank you very much folks, I did not expect such quick and good answers.

VERY appreciated

Eric

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 9

One thing I noticed in the execution plan is the residual IO. Actual rows is 321,374 but the number of rows read is 15,747,409. Seeing this with four of the operators in the execution plan. This could be slowing it down. But then again I could be wrong.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 10

Just curious, but looking at the execution plans for the queries populating the temp tables, is the DISTINCT actually needed? The number of rows entering and leaving the sort operator ar the same.

Eric Mamet SSChampion Points: 11728 More actions · Answer 11

Lynn Pettis - Thursday, May 17, 2018 7:49 AM
One thing I noticed in the execution plan is the residual IO. Actual rows is 321,374 but the number of rows read is 15,747,409. Seeing this with four of the operators in the execution plan. This could be slowing it down. But then again I could be wrong.

I think you are absolutely right and it is precisely what I am after.
I suspect the various ideas I was given above might help.

Unfortunately, I vastly simplified the initial query context because some of that code is generated on the fly (hence the use of sp_executesql) so I'll have to be a bit careful how I change the thing.

ScottPletcher SSC Guru Points: 101118 More actions · Answer 12

Your clustered index is fine.
You really don't need the temp tables for this specific query, but if building them is not taking much time, then that's OK too.

But in the final query, for this specific query, since you're only grouping by DateKey, you can pre-summarize the temp tables and avoid a lot of JOIN overhead. Since this may be a generic process, not sure how well you'll able to adjust the code that gens the query to do this.

SELECT DS.DateKey ,SalesUnits = SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0)) ,SalesValue = SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0)) ,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0)) ,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0)) FROM rstore.OutputDataset DS LEFT JOIN ( SELECT DateKey, SUM(GrossupUnit) AS GrossupUnit FROM #GrossupUnit GROUP BY DateKey ) AS GU ON GU.DateKey = DS.DateKey LEFT JOIN ( SELECT DateKey, SUM(GrossupValue) AS GrossupValue FROM #GrossupValue GROUP BY DateKey ) AS GV ON GV.DateKey = DS.DateKey WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey AND DS.DateKey >= @DateIdFrom AND DS.DateKey <= @DateIdTo GROUP BY DS.DateKey

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Eric Mamet SSChampion Points: 11728 More actions · Answer 13

ScottPletcher - Thursday, May 17, 2018 8:24 AM
Your clustered index is fine.
You really don't need the temp tables for this specific query, but if building them is not taking much time, then that's OK too.
But in the final query, for this specific query, since you're only grouping by DateKey, you can pre-summarize the temp tables and avoid a lot of JOIN overhead. Since this may be a generic process, not sure how well you'll able to adjust the code that gens the query to do this.
SELECT DS.DateKey ,SalesUnits = SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0)) ,SalesValue = SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0)) ,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0)) ,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0)) FROM rstore.OutputDataset DS LEFT JOIN ( SELECT DateKey, SUM(GrossupUnit) AS GrossupUnit FROM #GrossupUnit GROUP BY DateKey ) AS GU ON GU.DateKey = DS.DateKey LEFT JOIN ( SELECT DateKey, SUM(GrossupValue) AS GrossupValue FROM #GrossupValue GROUP BY DateKey ) AS GV ON GV.DateKey = DS.DateKey WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey AND DS.DateKey >= @DateIdFrom AND DS.DateKey <= @DateIdTo GROUP BY DS.DateKey

I think you are right but the "real" code is actually generic so I don't know in advance what filters I'll have AND what aggregation will be required.

However, I only have a performance problem in this particular scenario (as far as I can tell) so I might try to create a specific sub procedure to implement just this particular case.
This would allow me to make some assumptions and take into consideration the various advice I just got.
I also have occasional problems with re-compilations so having specialised sub-procedure(s) is probably useful in that sense too.

Eric Mamet SSChampion Points: 11728 More actions · Answer 14

Lynn Pettis - Thursday, May 17, 2018 7:53 AM
Just curious, but looking at the execution plans for the queries populating the temp tables, is the DISTINCT actually needed? The number of rows entering and leaving the sort operator ar the same.

Well spotted!

The DISTINCTs are necessary because whoever designed these tables made a mistake and we store duplicates!
Not much I can do about this one... I'd like to redesign these tables but not sure whether this will ever happen... :Whistling: