Indexing tips for MIN() and MAX() in queries

  • Hi,

    Are there any best practices for indexing to support queries with MIN() and MAX() in them?

    Bonus question: what if MIN() and MAX() are partitioned?

    Super bonus question: what if MIN() and MAX() are not only partitioned, but are called on a field in a derived table, and one of the partitioning elements comes from a table that's being joined in the derived table?

    Kind of makes the usual POC index impossible.

    I experimented with inserting the derived table into a temp table, putting a POC index on that, and querying out, but that actually took longer.

    Anonymized plan is attached. That's the best I can do in public. If anyone wants the actual, I will provide it privately

    Thanks

  • It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

    Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

    For instance...

    SELECT CustID, SUM (order) AS ORDERS

    FROM Sales

    GROUP BY CustID

    A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

    You should put an Index on the filter though, which is the WHERE clause. if there is one.

    I guess that your best bet is testing and check actual execution plans.

  • sql-lover (6/6/2014)


    It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

    Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

    For instance...

    SELECT CustID, SUM (order) AS ORDERS

    FROM Sales

    GROUP BY CustID

    A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

    You should put an Index on the filter though, which is the WHERE clause. if there is one.

    I guess that your best bet is testing and check actual execution plans.

    It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!

    I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

  • Actually MIN() and MAX() could use an index, even a seek, if the index was keyed on all the GROUP BY columns in the proper order.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • sqldriver (6/6/2014)


    sql-lover (6/6/2014)


    It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

    Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

    For instance...

    SELECT CustID, SUM (order) AS ORDERS

    FROM Sales

    GROUP BY CustID

    A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

    You should put an Index on the filter though, which is the WHERE clause. if there is one.

    I guess that your best bet is testing and check actual execution plans.

    It's awesome, there is no WHERE clause! It's just tossing million of rows around all willy-nilly!

    I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

    well, that was an example my friend, lol ...

    Like I said, depending of how many rows, the optimizer will do a scan anyway, in order to calculate or perform the aggregate.

    One way, if you have Enterprise, is doing partitioning. But I think you need to pick the aggregate column as the partitioning column. Testing with your actual data and schema will may confirm or not this.

    Anothe workaround, which may improve performance on some cases, is using Window Functions, but end result will depend of your data. On some cases, they perform better than normal aggregates. And its implementation is better on SQL2012.

  • sqldriver (6/6/2014)


    sql-lover (6/6/2014)


    It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

    Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

    For instance...

    SELECT CustID, SUM (order) AS ORDERS

    FROM Sales

    GROUP BY CustID

    A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

    You should put an Index on the filter though, which is the WHERE clause. if there is one.

    I guess that your best bet is testing and check actual execution plans.

    It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!

    I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

    In 643 lines of T-SQL, there are two GROUP BYs, neither one where MIN() is called. Though I can go to the devs and ask if one can be used.

    Thanks.

  • sql-lover (6/6/2014)


    sqldriver (6/6/2014)


    sql-lover (6/6/2014)


    It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

    Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

    For instance...

    SELECT CustID, SUM (order) AS ORDERS

    FROM Sales

    GROUP BY CustID

    A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

    You should put an Index on the filter though, which is the WHERE clause. if there is one.

    I guess that your best bet is testing and check actual execution plans.

    It's awesome, there is no WHERE clause! It's just tossing million of rows around all willy-nilly!

    I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

    well, that was an example my friend, lol ...

    Like I said, depending of how many rows, the optimizer will do a scan anyway, in order to calculate or perform the aggregate.

    One way, if you have Enterprise, is doing partitioning. But I think you need to pick the aggregate column as the partitioning column. Testing with your actual data and schema will may confirm or not this.

    Anothe workaround, which may improve performance on some cases, is using Window Functions, but end result will depend of your data. On some cases, they perform better than normal aggregates. And its implementation is better on SQL2012.

    MIN() is windowed in this instance. I got things a little mixed up because this production server is running 2012 - it's the only one so far. It's not the scan I'm worried about so much as it is the spills into tempdb. MIN() gets called 6 times and each time it spills into tempdb. I was hoping some indexing I'm missing could help with that.

    Thanks

  • sqldriver (6/6/2014)


    sql-lover (6/6/2014)


    It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

    Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

    For instance...

    SELECT CustID, SUM (order) AS ORDERS

    FROM Sales

    GROUP BY CustID

    A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

    You should put an Index on the filter though, which is the WHERE clause. if there is one.

    I guess that your best bet is testing and check actual execution plans.

    It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!

    I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

    Looks like a valid query to me. For all customers show me the sum of all their orders. No where clause is even implied in the requirement.

  • Lynn Pettis (6/6/2014)


    sqldriver (6/6/2014)


    sql-lover (6/6/2014)


    It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

    Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

    For instance...

    SELECT CustID, SUM (order) AS ORDERS

    FROM Sales

    GROUP BY CustID

    A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

    You should put an Index on the filter though, which is the WHERE clause. if there is one.

    I guess that your best bet is testing and check actual execution plans.

    It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!

    I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

    Looks like a valid query to me. For all customers show me the sum of all their orders. No where clause is even implied in the requirement.

    Typically you'd specify a time frame(s). Otherwise it's very hard to get any real meaning from the results.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (6/6/2014)


    Lynn Pettis (6/6/2014)


    sqldriver (6/6/2014)


    sql-lover (6/6/2014)


    It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

    Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

    For instance...

    SELECT CustID, SUM (order) AS ORDERS

    FROM Sales

    GROUP BY CustID

    A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

    You should put an Index on the filter though, which is the WHERE clause. if there is one.

    I guess that your best bet is testing and check actual execution plans.

    It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!

    I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

    Looks like a valid query to me. For all customers show me the sum of all their orders. No where clause is even implied in the requirement.

    Typically you'd specify a time frame(s). Otherwise it's very hard to get any real meaning from the results.

    You might have a time requirement on it, but then again you might not. Depends on the question being asked. I don't see anything wrong with the question I asked above. Simple question, show me all my customers and the sum of all their orders.

  • This query isn't actually returning any data to a user; it's part of an archival process and a @BatchID gets specified rather than a range to query. Batches can range from 2-11 million rows (from what I've seen), and MIN() is getting called on an INT ID (but not PK or FK) column, and partitioned by two other ID and two hash columns, depending on some CASE logic.

    I tried to put all the case logic in a CROSS APPLY, thinking the encapsulation might help, but it took on average 10x as long to run. Oops. :ermm:

  • Sometimes a little test helps to clear things up. The indexing does seem to help, so the trade-off is whether the improvement is enough to offset the creation of the index itself. A lot of the help here is due to using MIN() or MAX() as opposed to the other aggregate functions.

    Test setup:

    drop table T1

    create table T1(id int identity(1,1), gval integer, ival integer)

    insert T1( Gval,ival)

    select top 10000000 ROW_NUMBER() over (order by (select null)) %25, CHECKSUM(newid())

    from sys.columns sc1 cross join sys.columns sc2 cross join sys.columns sc3

    create table T2(id int identity(1,1), gval integer, ival integer)

    create table T3(id int identity(1,1), gval integer, ival integer)

    if exists(select null from sys.indexes where name='GroupingShort')

    drop index groupingshort on T2

    Insert into T2(gval,ival) select gval,ival from T1

    create index groupingshort on T2(gval) --JUST THE GROUPING LEVEL

    if exists(select null from sys.indexes where name='groupinglonger')

    drop index groupinglonger on T3

    Insert into T3(gval,ival) select gval,ival from T1

    create index groupinglonger on T3(gval, Ival)

    T1 is a heap with no helper indexes at all

    T2 is a heap with grouping column in the index

    T3 is a heap with grouping column + aggregated column

    Scenario test:

    set statistics time ON

    set statistics IO ON

    select gval, MIN(ival)--, MAX(ival)

    from t1 group by gval

    select gval, MIN(ival)--, MAX(ival)

    from t2 group by gval

    select gval, MIN(ival)--, MAX(ival)

    from t3 group by gval

    set statistics IO OFF

    set statistics time OFF

    Results

    (25 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'T1'. Scan count 9, logical reads 25975, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 4726 ms, elapsed time = 727 ms.

    (25 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'T2'. Scan count 9, logical reads 25975, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 5320 ms, elapsed time = 744 ms.

    (25 row(s) affected)

    Table 'T3'. Scan count 9, logical reads 28184, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 3042 ms, elapsed time = 459 ms.

    Note: the query against T3 is using an index scan, the other 2 table scans.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (6/6/2014)


    Sometimes a little test helps to clear things up. The indexing does seem to help, so the trade-off is whether the improvement is enough to offset the creation of the index itself. A lot of the help here is due to using MIN() or MAX() as opposed to the other aggregate functions.

    Test setup:

    drop table T1

    create table T1(id int identity(1,1), gval integer, ival integer)

    insert T1( Gval,ival)

    select top 10000000 ROW_NUMBER() over (order by (select null)) %25, CHECKSUM(newid())

    from sys.columns sc1 cross join sys.columns sc2 cross join sys.columns sc3

    create table T2(id int identity(1,1), gval integer, ival integer)

    create table T3(id int identity(1,1), gval integer, ival integer)

    if exists(select null from sys.indexes where name='GroupingShort')

    drop index groupingshort on T2

    Insert into T2(gval,ival) select gval,ival from T1

    create index groupingshort on T2(gval) --JUST THE GROUPING LEVEL

    if exists(select null from sys.indexes where name='groupinglonger')

    drop index groupinglonger on T3

    Insert into T3(gval,ival) select gval,ival from T1

    create index groupinglonger on T3(gval, Ival)

    T1 is a heap with no helper indexes at all

    T2 is a heap with grouping column in the index

    T3 is a heap with grouping column + aggregated column

    Scenario test:

    set statistics time ON

    set statistics IO ON

    select gval, MIN(ival)--, MAX(ival)

    from t1 group by gval

    select gval, MIN(ival)--, MAX(ival)

    from t2 group by gval

    select gval, MIN(ival)--, MAX(ival)

    from t3 group by gval

    set statistics IO OFF

    set statistics time OFF

    Results

    (25 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'T1'. Scan count 9, logical reads 25975, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 4726 ms, elapsed time = 727 ms.

    (25 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'T2'. Scan count 9, logical reads 25975, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 5320 ms, elapsed time = 744 ms.

    (25 row(s) affected)

    Table 'T3'. Scan count 9, logical reads 28184, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 3042 ms, elapsed time = 459 ms.

    Note: the query against T3 is using an index scan, the other 2 table scans.

    After reading your post, and stumbling on this one, I gave the query a go without any indexes, and replaced as many of the INSERT INTO/SELECT statements with SELECT/INTO. It did improve the query time by about 20 seconds on average, bringing the time to just about 4 minutes.

    Thanks to everyone for your time

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply