Help me find the best index for a specific query

  • This is my table:

    use tempdb

    go

    if object_id('Data', 'u') is not null drop table Data

    go

    with temp as (

    select top 10000 row_number() over (order by c1.object_id) Id

    from sys.columns c1 cross join sys.columns c2

    )

    select

    Id,

    ((Id - 1) / 5000) + 1 as StateId,

    ((Id - 1) / 1000) + 1 as PalletId,

    ((Id - 1) / 100) + 1 as BoxId

    into Data

    from temp

    order by Id

    In reality the number of rows is a bit more than 100M. Those are the only queries that will be used on the table:

    select StateId, min(Id), max(Id)

    from Data

    group by StateId

    order by StateId, min(Id)

    select StateId, PalletId, min(Id), max(Id)

    from Data

    group by StateId, PalletId

    order by StateId, PalletId, min(Id)

    select StateId, PalletId, BoxId, min(Id), max(Id)

    from Data

    group by StateId, PalletId, BoxId

    order by StateId, PalletId, BoxId, min(Id)

    What index would be best for these three queries? With best I mean the execution time, I don't care about additional space.

    This is the index I currently use:

    create nonclustered index Ix_Data on Data (StateId, PalletId, BoxId, Id)

    The execution plan is SELECT (0%) - Stream Aggregate (10%) - Index Scan (90%). Can this be optimized (maybe to use Index Seek method)?

  • That is about the best index. I'd put ID as an include, as it's a bit wasted in the key.

    An index seek operation requires a filter predicate. None of your queries have one, so it is not possible to search. In all cases, you're asking for all the rows in the table, so an operation which returns all the rows in the table (scan) is indicated

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks!

  • To seek, you need a predicate, something to seek for. A seek is defined as a navigation down the index tree to find a value. You do not have anything that can be used as a predicate. Your query requires every row of the table, therefore a scan (in this case of the index) is required.

    Yes, it requires just the index, but that's exactly what you're getting, an index scan.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • You should make it a clustered index. You should also explicitly specify the appropriate FILLFACTOR and filegroup (even if it's PRIMARY, as that may not be the current default fg in the db).

    I can't tell you what the best fillfactor would be without more details about the table data. If you're unsure, start with 95 (=95%) and work your way up or down from there as best suits that particular table.

    CREATE CLUSTERED INDEX CL_Data ON Data ( StateId, PalletId, BoxId )

    WITH ( FILLFACTOR = 95 /*assign better value if you can determine it!*/ )

    ON [PRIMARY]

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • @scottpletcher - I tried it now and it seems that both clustered and nonclustered indexes are running for about the same time.

  • _simon_ (12/30/2014)


    @ScottPletcher - I tried it now and it seems that both clustered and nonclustered indexes are running for about the same time.

    I found the same, with the test data scaled up to match your real table:

    IF object_id('Data', 'u') IS NOT NULL DROP TABLE Data;

    CREATE TABLE Data (Id INT NOT NULL, StateId INT NOT NULL, PalletId INT NOT NULL, BoxId INT NOT NULL);

    GO

    WITH temp AS (

    SELECT TOP 100000000

    Id = CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS INT)

    FROM sys.columns c1 CROSS JOIN sys.columns c2 CROSS JOIN sys.columns c3

    )

    INSERT INTO Data (Id, StateId, PalletId, BoxId)

    SELECT

    Id,

    [StateId]= ((Id - 1) / 5000) + 1,

    [PalletId]= ((Id - 1) / 1000) + 1,

    [BoxId]= ((Id - 1) / 100) + 1

    FROM temp

    ORDER BY Id;

    The simplest query returns 20,000 rows in about 5s:

    SELECT StateId, MIN_Id = MIN(Id), MAX_Id = MAX(Id) FROM Data GROUP BY StateId ORDER BY StateId; -- (20000 row(s) affected) / 00:00:05

    That's a relatively small number of discrete values for StateId and reminded me of this little gem, so I did a little testing. If you cluster the table on just Id

    CREATE UNIQUE CLUSTERED INDEX ucx_Data ON Data (Id)

    then you have some flexibility on the non-clustered indexes. I created a simple one, expecting it to match the query I'll show in a minute:

    CREATE UNIQUE INDEX ix_StateId ON Data (StateId, Id)

    Note that the query runs only marginally faster if Id is a key in this index, compared to INCLUDEing it.

    Here's the query:

    WITH RecursiveCTE

    AS (

    SELECT TOP 1 StateId, Id

    FROM dbo.Data

    ORDER BY StateId, Id

    UNION ALL

    SELECT r.StateId, r.Id

    FROM (

    SELECT t.StateId, t.Id,

    rn = ROW_NUMBER() OVER (ORDER BY t.StateId, t.Id)

    FROM dbo.Data t

    JOIN RecursiveCTE ri

    ON ri.StateId < t.StateId

    ) r

    WHERE r.rn = 1

    )

    SELECT r.StateId, r.Id, x.Id

    FROM RecursiveCTE r

    CROSS APPLY (

    SELECT TOP 1 Id

    FROM dbo.Data t

    WHERE t.StateId = r.StateId

    ORDER BY Id DESC

    ) x

    OPTION (MAXRECURSION 0);

    Note that it uses a scan to determine the anchor row of the recursive CTE, the remaining rows are returned using seeks. On our little play server it runs in about 450ms, ten times faster than the original query.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • _simon_ (12/30/2014)


    @ScottPletcher - I tried it now and it seems that both clustered and nonclustered indexes are running for about the same time.

    Keep in mind that using the clustered index also means you don't have to store all the data twice, as you do with a nonclustered index over a heap or other cluster.

    I should have included ID in the index, because of the queries being run. That's the best times you can reasonably expect without creating customized, covering non-clustered index(es), and I'm not sure those will really help much either.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply