different estimated rows on same index operation?

Question

Post reply

different estimated rows on same index operation?

WolfgangE

SSCertifiable

Points: 6328
More actions
March 16, 2015 at 9:49 am

#305767

hello sql-friends,
the question is about index statistics, this is why I chose the performance forum, not the SQL-forum.
introduction and background
I had to optimize a simple query (example below). After rewriting it several times I recognized that the estimated row count on the one and same index operation differs depending on the way the query is written.
Originally the query did a clustered index scan, as the table in production contains a binary column the table is quite large (about 100 GB) and the full table scan takes too much time to execute.
the question
Why is the estimated row count different on the same index operation (example will show)? What is the optimizer doing here?
the example database
I tried to create a very simplyfied version of my production tables (thanks to my colleague) that show the behaviour.
-- CREATE THE SAMPLE TABLES
----------------------------
CREATE TABLE dbo.MasterTable(
MasterId smallint NOT NULL,
Name varchar(5) NOT NULL,
CONSTRAINT PK_MasterTable PRIMARY KEY CLUSTERED (MasterId ASC)
) ON [PRIMARY]
GO
CREATE TABLE dbo.DetailTable(
DetailId bigint IDENTITY(1,1) NOT NULL,
MasterId smallint NOT NULL,
Name nvarchar(50) NOT NULL,
CreateDate datetime NOT NULL,
CONSTRAINT PK_DetailTable PRIMARY KEY CLUSTERED (DetailId ASC)
) ON [PRIMARY]
GO
ALTER TABLE dbo.DetailTable
ADD CONSTRAINT FK1
FOREIGN KEY(MasterId) REFERENCES dbo.MasterTable (MasterId)
GO
CREATE NONCLUSTERED INDEX IX_DetailTable
ON dbo.DetailTable( MasterId ASC, Name ASC )
GO
-- INSERT SOME SAMPLE DATA
----------------------------
SET NOCOUNT ON
go
-- These are some Codes. In our system we always use these codes tosearch for "types" of data.
INSERT INTO dbo.MasterTable (MasterId, Name)
VALUES (1, 'N1'), (2, 'N2'), (3, 'N3'), (4, 'N4'), (5, 'N5'), (6, 'N6'), (7, 'N7'), (8, 'N8')
go
-- ADD ROWS TO THE DETAIL TABLE
-- Takes about 1 minute to run
-- Don't care about the logic, it's just to get a distribution similar to production system
----------------------------
declare @x int = 1
DECLARE @MasterID INT
while (@x <= 400000)
begin
SET @MasterID = ABS(CHECKSUM(NEWID())) % 8 + 1
INSERT INTO dbo.DetailTable
(
MasterId,
Name,
CreateDate)
VALUES
(
CASE WHEN @MasterID IN (1, 3, 4) AND @x % 20 != 0 THEN 2
WHEN @MasterID IN (5, 6) AND @x % 20 != 0 THEN 7
WHEN @MasterID = 8 AND @x % 100 != 0 THEN 7
ELSE @MasterID
END,
NEWID(),
DATEADD(DAY, - ABS(CHECKSUM(NEWID())) % 1000, GETDATE())
)
SET @x = @x + 1
end
go
-- DO THE INDEX AND STATISTIC MAINTENANCE
----------------------------
alter index all on dbo.DetailTable reorganize
alter index all on dbo.MasterTable reorganize
update statistics dbo.DetailTable WITH FULLSCAN
update statistics dbo.MasterTable WITH FULLSCAN
go
Preparation is done, let's start with the query
Let's have a look at the statistics first, look at RANGE_HI_KEY=8, there are 489 EQ_ROWS
-- CHECK THE STATISTICS
----------------------------
dbcc show_statistics ('dbo.DetailTable', IX_DetailTable)
go
No we do the query. The first one is the original query I had to optimize.
Please activate the current execution plan when executing.
Have a look at the operation "index seek (nonclustered) [DetailTable].[IX_DetailTable]"
-- ORIGINAL QUERY
----------------------------
SELECT
d.DetailId
FROM dbo.DetailTable d
INNER JOIN dbo.MasterTable m ON d.MasterId = m.MasterId
WHERE m.Name = 'N8'
AND d.CreateDate > '20150312 11:00:00'
go
-- FORCESEEK
----------------------------
SELECT
d.DetailId
FROM dbo.DetailTable d WITH (FORCESEEK)
INNER JOIN dbo.MasterTable m ON d.MasterId = m.MasterId
WHERE m.Name = 'N8'
AND d.CreateDate > '20150312 11:00:00'
go
-- Actual: 495, Estimated 50.000
-- TABLE VARIABLE
----------------------------
DECLARE @MasterId AS TABLE(MasterId SMALLINT )
INSERT INTO @MasterId (MasterId)
SELECT MasterID FROM dbo.MasterTable WHERE Name = 'N8'
SELECT
d.DetailId
FROM dbo.DetailTable d WITH (FORCESEEK)
INNER JOIN @MasterId m ON d.MasterId = m.MasterId
WHERE d.CreateDate > '20150312 11:00:00'
go
-- Actual: 495, Estimated 40.000
-- TEMP TABLE
----------------------------
CREATE TABLE #MasterId( MasterId SMALLINT )
INSERT INTO #MasterId (MasterId) SELECT MasterID FROM dbo.MasterTable WHERE Name = 'N8'
SELECT
d.DetailId
FROM dbo.DetailTable d --WITH (FORCESEEK)
INNER JOIN #MasterId m ON d.MasterId = m.MasterId
WHERE d.CreateDate > '20150312 11:00:00'
-- Actual 489, Estimated 489
DROP TABLE #MasterId
go
And of corse the cleanup 🙂
-- CLEANUP
----------------------------
drop table dbo.DetailTable;
drop table dbo.MasterTable;
go
Analyse and final question(s)
Please have a look at the operation "index seek (nonclustered) [DetailTable].[IX_DetailTable]"
The comments in the skript above show you the values I got for estimated and actual row count.
In our production environment this table has 33 million rows, the estimated rows in the queries above differ from 3 million to 16 million.
The question is why do these values differ? The statistics are up to date and making an estimation should really be easy, no?
What is optimizer doing here and why?
I just would like to understand this.
And finally: what would your suggestion be to optimize this query if you must avoid the clustered index scan?
Thank you for reading this 🙂 i'm looking forward for some interesting replies 🙂
Have a nice week, Wolfgang

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

Try these indexes:

CREATE INDEX ix_MasterId_Etc ON DetailTable (MasterId, CreateDate) INCLUDE (DetailId)

CREATE INDEX ix_MasterId_Etc ON MasterTable (Name, MasterId)

These are a first guess based on the join columns, filters and output columns in your sample query.

Edit: DetailId is in the clustered index.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

WolfgangE SSCertifiable Points: 6328 More actions · Answer 2

hi,

thank you for the answer.

I know I could optimize the index, but that's not the point. I'd like to understand why optimizer is estimating the way it does 🙂

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

WolfgangE (3/16/2015)
hi,
thank you for the answer.
I know I could optimize the index, but that's not the point. I'd like to understand why optimizer is estimating the way it does 🙂

This is how I understand it - I'm sure there will be better explanations to follow.

-- Drop the CreateDate column for simplicity, the estimates are higher

-- than the actuals by the same factor

-- 512 is the exact number of rows in DetailTable where MasterId = 8

SELECT

d.DetailId

FROM dbo.DetailTable d -- Est: 512 Act: 512

INNER JOIN dbo.MasterTable m

ON d.MasterId = m.MasterId

WHERE m.MasterId = 8

-- 50,000 is the average number of rows for each MasterId in DetailTable.

-- The optimiser can't know what MasterId it will be at run time.

SELECT

d.DetailId

FROM dbo.DetailTable d -- Est: 50,000 Act: 512

INNER JOIN dbo.MasterTable m

ON d.MasterId = m.MasterId

WHERE m.Name = 'N8'

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

MadAdmin SSChampion Points: 11261 More actions · Answer 4

Take into account that table variable stats is hard coded to be 1 row where temp table has actual row count.

This would be something to consider as Table variable might make nested loop joins occur and/or carry forward an underestimated row count..

Catch-all queries done right [/url]
Gail Shaw's Performance Blog[/url]