Indexing tips for MIN() and MAX() in queries

Question

Indexing tips for MIN() and MAX() in queries

sqldriver

SSChampion

Points: 10440
More actions
June 6, 2014 at 10:09 am

#292150

Hi,
Are there any best practices for indexing to support queries with MIN() and MAX() in them?
Bonus question: what if MIN() and MAX() are partitioned?
Super bonus question: what if MIN() and MAX() are not only partitioned, but are called on a field in a derived table, and one of the partitioning elements comes from a table that's being joined in the derived table?
Kind of makes the usual POC index impossible.
I experimented with inserting the derived table into a temp table, putting a POC index on that, and querying out, but that actually took longer.
Anonymized plan is attached. That's the best I can do in public. If anyone wants the actual, I will provide it privately
Thanks

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

sql-lover SSCoach Points: 18530 More actions · Answer 1

It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.

Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.

For instance...

SELECT CustID, SUM (order) AS ORDERS

FROM Sales

GROUP BY CustID

A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).

You should put an Index on the filter though, which is the WHERE clause. if there is one.

I guess that your best bet is testing and check actual execution plans.

sqldriver SSChampion Points: 10440 More actions · Answer 2

sql-lover (6/6/2014)
It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.
Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.
For instance...
SELECT CustID, SUM (order) AS ORDERS
FROM Sales
GROUP BY CustID
A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).
You should put an Index on the filter though, which is the WHERE clause. if there is one.
I guess that your best bet is testing and check actual execution plans.

It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!

I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

ScottPletcher SSC Guru Points: 101238 More actions · Answer 3

Actually MIN() and MAX() could use an index, even a seek, if the index was keyed on all the GROUP BY columns in the proper order.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

sql-lover SSCoach Points: 18530 More actions · Answer 4

sqldriver (6/6/2014)
sql-lover (6/6/2014)
It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.
Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.
For instance...
SELECT CustID, SUM (order) AS ORDERS
FROM Sales
GROUP BY CustID
A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).
You should put an Index on the filter though, which is the WHERE clause. if there is one.
I guess that your best bet is testing and check actual execution plans.
It's awesome, there is no WHERE clause! It's just tossing million of rows around all willy-nilly!
I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

well, that was an example my friend, lol ...

Like I said, depending of how many rows, the optimizer will do a scan anyway, in order to calculate or perform the aggregate.

One way, if you have Enterprise, is doing partitioning. But I think you need to pick the aggregate column as the partitioning column. Testing with your actual data and schema will may confirm or not this.

Anothe workaround, which may improve performance on some cases, is using Window Functions, but end result will depend of your data. On some cases, they perform better than normal aggregates. And its implementation is better on SQL2012.

sqldriver SSChampion Points: 10440 More actions · Answer 5

sqldriver (6/6/2014)
sql-lover (6/6/2014)
It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.
Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.
For instance...
SELECT CustID, SUM (order) AS ORDERS
FROM Sales
GROUP BY CustID
A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).
You should put an Index on the filter though, which is the WHERE clause. if there is one.
I guess that your best bet is testing and check actual execution plans.
It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!
I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

In 643 lines of T-SQL, there are two GROUP BYs, neither one where MIN() is called. Though I can go to the devs and ask if one can be used.

Thanks.

sqldriver SSChampion Points: 10440 More actions · Answer 6

sql-lover (6/6/2014)
sqldriver (6/6/2014)
sql-lover (6/6/2014)
It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.
Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.
For instance...
SELECT CustID, SUM (order) AS ORDERS
FROM Sales
GROUP BY CustID
A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).
You should put an Index on the filter though, which is the WHERE clause. if there is one.
I guess that your best bet is testing and check actual execution plans.
It's awesome, there is no WHERE clause! It's just tossing million of rows around all willy-nilly!
I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.
well, that was an example my friend, lol ...
Like I said, depending of how many rows, the optimizer will do a scan anyway, in order to calculate or perform the aggregate.
One way, if you have Enterprise, is doing partitioning. But I think you need to pick the aggregate column as the partitioning column. Testing with your actual data and schema will may confirm or not this.
Anothe workaround, which may improve performance on some cases, is using Window Functions, but end result will depend of your data. On some cases, they perform better than normal aggregates. And its implementation is better on SQL2012.

MIN() is windowed in this instance. I got things a little mixed up because this production server is running 2012 - it's the only one so far. It's not the scan I'm worried about so much as it is the spills into tempdb. MIN() gets called 6 times and each time it spills into tempdb. I was hoping some indexing I'm missing could help with that.

Thanks

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 7

sqldriver (6/6/2014)
sql-lover (6/6/2014)
It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.
Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.
For instance...
SELECT CustID, SUM (order) AS ORDERS
FROM Sales
GROUP BY CustID
A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).
You should put an Index on the filter though, which is the WHERE clause. if there is one.
I guess that your best bet is testing and check actual execution plans.
It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!
I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.

Looks like a valid query to me. For all customers show me the sum of all their orders. No where clause is even implied in the requirement.

ScottPletcher SSC Guru Points: 101238 More actions · Answer 8

Lynn Pettis (6/6/2014)
sqldriver (6/6/2014)
sql-lover (6/6/2014)
It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.
Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.
For instance...
SELECT CustID, SUM (order) AS ORDERS
FROM Sales
GROUP BY CustID
A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).
You should put an Index on the filter though, which is the WHERE clause. if there is one.
I guess that your best bet is testing and check actual execution plans.
It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!
I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.
Looks like a valid query to me. For all customers show me the sum of all their orders. No where clause is even implied in the requirement.

Typically you'd specify a time frame(s). Otherwise it's very hard to get any real meaning from the results.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 9

ScottPletcher (6/6/2014)
Lynn Pettis (6/6/2014)
sqldriver (6/6/2014)
sql-lover (6/6/2014)
It is my understanding that an aggregate function has to check all values in the column. So an Index on that attribute may not help too much.
Now, not sure how the optimizer will work if the Index is a covered one. I guess it depends of how many rows you have and its cardinality.
For instance...
SELECT CustID, SUM (order) AS ORDERS
FROM Sales
GROUP BY CustID
A covered Index on Custid and Order may improve performance depending of how many rows we have and if there is a PK or not and on what column(s).
You should put an Index on the filter though, which is the WHERE clause. if there is one.
I guess that your best bet is testing and check actual execution plans.
It's awesome, there is no WHERE clause! It's just tossing millions of rows around all willy-nilly!
I do index the columns you mentioned, it's just still resulting in expensive-ish sorts that spill into tempdb. You can see them in the last section of the execution plan I posted.
Looks like a valid query to me. For all customers show me the sum of all their orders. No where clause is even implied in the requirement.
Typically you'd specify a time frame(s). Otherwise it's very hard to get any real meaning from the results.

You might have a time requirement on it, but then again you might not. Depends on the question being asked. I don't see anything wrong with the question I asked above. Simple question, show me all my customers and the sum of all their orders.

sqldriver SSChampion Points: 10440 More actions · Answer 10

This query isn't actually returning any data to a user; it's part of an archival process and a @BatchID gets specified rather than a range to query. Batches can range from 2-11 million rows (from what I've seen), and MIN() is getting called on an INT ID (but not PK or FK) column, and partitioned by two other ID and two hash columns, depending on some CASE logic.

I tried to put all the case logic in a CROSS APPLY, thinking the encapsulation might help, but it took on average 10x as long to run. Oops. :ermm:

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 11

Sometimes a little test helps to clear things up. The indexing does seem to help, so the trade-off is whether the improvement is enough to offset the creation of the index itself. A lot of the help here is due to using MIN() or MAX() as opposed to the other aggregate functions.

Test setup:

drop table T1

create table T1(id int identity(1,1), gval integer, ival integer)

insert T1( Gval,ival)

select top 10000000 ROW_NUMBER() over (order by (select null)) %25, CHECKSUM(newid())

from sys.columns sc1 cross join sys.columns sc2 cross join sys.columns sc3

create table T2(id int identity(1,1), gval integer, ival integer)

create table T3(id int identity(1,1), gval integer, ival integer)

if exists(select null from sys.indexes where name='GroupingShort')

drop index groupingshort on T2

Insert into T2(gval,ival) select gval,ival from T1

create index groupingshort on T2(gval) --JUST THE GROUPING LEVEL

if exists(select null from sys.indexes where name='groupinglonger')

drop index groupinglonger on T3

Insert into T3(gval,ival) select gval,ival from T1

create index groupinglonger on T3(gval, Ival)

T1 is a heap with no helper indexes at all

T2 is a heap with grouping column in the index

T3 is a heap with grouping column + aggregated column

Scenario test:

set statistics time ON

set statistics IO ON

select gval, MIN(ival)--, MAX(ival)

from t1 group by gval

select gval, MIN(ival)--, MAX(ival)

from t2 group by gval

select gval, MIN(ival)--, MAX(ival)

from t3 group by gval

set statistics IO OFF

set statistics time OFF

Results

(25 row(s) affected)

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'T1'. Scan count 9, logical reads 25975, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 4726 ms, elapsed time = 727 ms.

(25 row(s) affected)

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'T2'. Scan count 9, logical reads 25975, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 5320 ms, elapsed time = 744 ms.

(25 row(s) affected)

Table 'T3'. Scan count 9, logical reads 28184, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 3042 ms, elapsed time = 459 ms.

Note: the query against T3 is using an index scan, the other 2 table scans.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

sqldriver SSChampion Points: 10440 More actions · Answer 12

Matt Miller (#4) (6/6/2014)
Sometimes a little test helps to clear things up. The indexing does seem to help, so the trade-off is whether the improvement is enough to offset the creation of the index itself. A lot of the help here is due to using MIN() or MAX() as opposed to the other aggregate functions.
Test setup:
drop table T1
create table T1(id int identity(1,1), gval integer, ival integer)
insert T1( Gval,ival)
select top 10000000 ROW_NUMBER() over (order by (select null)) %25, CHECKSUM(newid())
from sys.columns sc1 cross join sys.columns sc2 cross join sys.columns sc3
create table T2(id int identity(1,1), gval integer, ival integer)
create table T3(id int identity(1,1), gval integer, ival integer)
if exists(select null from sys.indexes where name='GroupingShort')
drop index groupingshort on T2
Insert into T2(gval,ival) select gval,ival from T1
create index groupingshort on T2(gval) --JUST THE GROUPING LEVEL
if exists(select null from sys.indexes where name='groupinglonger')
drop index groupinglonger on T3
Insert into T3(gval,ival) select gval,ival from T1
create index groupinglonger on T3(gval, Ival)
T1 is a heap with no helper indexes at all
T2 is a heap with grouping column in the index
T3 is a heap with grouping column + aggregated column
Scenario test:
set statistics time ON
set statistics IO ON
select gval, MIN(ival)--, MAX(ival)
from t1 group by gval
select gval, MIN(ival)--, MAX(ival)
from t2 group by gval
select gval, MIN(ival)--, MAX(ival)
from t3 group by gval
set statistics IO OFF
set statistics time OFF
Results
(25 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'T1'. Scan count 9, logical reads 25975, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 4726 ms, elapsed time = 727 ms.
(25 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'T2'. Scan count 9, logical reads 25975, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 5320 ms, elapsed time = 744 ms.
(25 row(s) affected)
Table 'T3'. Scan count 9, logical reads 28184, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 3042 ms, elapsed time = 459 ms.
Note: the query against T3 is using an index scan, the other 2 table scans.

After reading your post, and stumbling on this one, I gave the query a go without any indexes, and replaced as many of the INSERT INTO/SELECT statements with SELECT/INTO. It did improve the query time by about 20 seconds on average, bringing the time to just about 4 minutes.

Thanks to everyone for your time