Help me find the best index for a specific query

Question

Help me find the best index for a specific query

_simon_

SSCrazy

Points: 2329
More actions
December 28, 2014 at 11:18 am

#299801

This is my table:
use tempdb
go
if object_id('Data', 'u') is not null drop table Data
go
with temp as (
select top 10000 row_number() over (order by c1.object_id) Id
from sys.columns c1 cross join sys.columns c2
)
select
Id,
((Id - 1) / 5000) + 1 as StateId,
((Id - 1) / 1000) + 1 as PalletId,
((Id - 1) / 100) + 1 as BoxId
into Data
from temp
order by Id
In reality the number of rows is a bit more than 100M. Those are the only queries that will be used on the table:
select StateId, min(Id), max(Id)
from Data
group by StateId
order by StateId, min(Id)
select StateId, PalletId, min(Id), max(Id)
from Data
group by StateId, PalletId
order by StateId, PalletId, min(Id)
select StateId, PalletId, BoxId, min(Id), max(Id)
from Data
group by StateId, PalletId, BoxId
order by StateId, PalletId, BoxId, min(Id)
What index would be best for these three queries? With best I mean the execution time, I don't care about additional space.
This is the index I currently use:
create nonclustered index Ix_Data on Data (StateId, PalletId, BoxId, Id)
The execution plan is SELECT (0%) - Stream Aggregate (10%) - Index Scan (90%). Can this be optimized (maybe to use Index Seek method)?

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

That is about the best index. I'd put ID as an include, as it's a bit wasted in the key.

An index seek operation requires a filter predicate. None of your queries have one, so it is not possible to search. In all cases, you're asking for all the rows in the table, so an operation which returns all the rows in the table (scan) is indicated

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

_simon_ SSCrazy Points: 2329 More actions · Answer 2

_simon_

SSCrazy

Points: 2329

December 28, 2014 at 12:02 pm

#1766619

Thanks!

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

To seek, you need a predicate, something to seek for. A seek is defined as a navigation down the index tree to find a value. You do not have anything that can be used as a predicate. Your query requires every row of the table, therefore a scan (in this case of the index) is required.

Yes, it requires just the index, but that's exactly what you're getting, an index scan.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ScottPletcher SSC Guru Points: 101133 More actions · Answer 4

You should make it a clustered index. You should also explicitly specify the appropriate FILLFACTOR and filegroup (even if it's PRIMARY, as that may not be the current default fg in the db).

I can't tell you what the best fillfactor would be without more details about the table data. If you're unsure, start with 95 (=95%) and work your way up or down from there as best suits that particular table.

CREATE CLUSTERED INDEX CL_Data ON Data ( StateId, PalletId, BoxId )

WITH ( FILLFACTOR = 95 /*assign better value if you can determine it!*/ )

ON [PRIMARY]

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

_simon_ SSCrazy Points: 2329 More actions · Answer 5

@scottpletcher - I tried it now and it seems that both clustered and nonclustered indexes are running for about the same time.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

_simon_ (12/30/2014)
@ScottPletcher - I tried it now and it seems that both clustered and nonclustered indexes are running for about the same time.

I found the same, with the test data scaled up to match your real table:

IF object_id('Data', 'u') IS NOT NULL DROP TABLE Data;

CREATE TABLE Data (Id INT NOT NULL, StateId INT NOT NULL, PalletId INT NOT NULL, BoxId INT NOT NULL);

GO

WITH temp AS (

SELECT TOP 100000000

Id = CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS INT)

FROM sys.columns c1 CROSS JOIN sys.columns c2 CROSS JOIN sys.columns c3

)

INSERT INTO Data (Id, StateId, PalletId, BoxId)

SELECT

Id,

[StateId]= ((Id - 1) / 5000) + 1,

[PalletId]= ((Id - 1) / 1000) + 1,

[BoxId]= ((Id - 1) / 100) + 1

FROM temp

ORDER BY Id;

The simplest query returns 20,000 rows in about 5s:

SELECT StateId, MIN_Id = MIN(Id), MAX_Id = MAX(Id) FROM Data GROUP BY StateId ORDER BY StateId; -- (20000 row(s) affected) / 00:00:05

That's a relatively small number of discrete values for StateId and reminded me of this little gem, so I did a little testing. If you cluster the table on just Id

CREATE UNIQUE CLUSTERED INDEX ucx_Data ON Data (Id)

then you have some flexibility on the non-clustered indexes. I created a simple one, expecting it to match the query I'll show in a minute:

CREATE UNIQUE INDEX ix_StateId ON Data (StateId, Id)

Note that the query runs only marginally faster if Id is a key in this index, compared to INCLUDEing it.

Here's the query:

WITH RecursiveCTE

AS (

SELECT TOP 1 StateId, Id

FROM dbo.Data

ORDER BY StateId, Id

UNION ALL

SELECT r.StateId, r.Id

FROM (

SELECT t.StateId, t.Id,

rn = ROW_NUMBER() OVER (ORDER BY t.StateId, t.Id)

FROM dbo.Data t

JOIN RecursiveCTE ri

ON ri.StateId < t.StateId

) r

WHERE r.rn = 1

)

SELECT r.StateId, r.Id, x.Id

FROM RecursiveCTE r

CROSS APPLY (

SELECT TOP 1 Id

FROM dbo.Data t

WHERE t.StateId = r.StateId

ORDER BY Id DESC

) x

OPTION (MAXRECURSION 0);

Note that it uses a scan to determine the anchor row of the recursive CTE, the remaining rows are returned using seeks. On our little play server it runs in about 450ms, ten times faster than the original query.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ScottPletcher SSC Guru Points: 101133 More actions · Answer 7

_simon_ (12/30/2014)
@ScottPletcher - I tried it now and it seems that both clustered and nonclustered indexes are running for about the same time.

Keep in mind that using the clustered index also means you don't have to store all the data twice, as you do with a nonclustered index over a heap or other cluster.

I should have included ID in the index, because of the queries being run. That's the best times you can reasonably expect without creating customized, covering non-clustered index(es), and I'm not sure those will really help much either.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".