SQLServerCentral Article

Returning the Top X row for each group (SQL Spackle)

,

In our table we have a list of runners, the time the course was completed in and the runner's age. Our task is to retrieve the fastest X runners in each age range.

CREATE TABLE #Runners
(
Runner integer NOT NULL,
Time integer NOT NULL,
Age integer NOT NULL
)
INSERT INTO #Runners
SELECT 1 , 10 , 20 UNION ALL
SELECT 2 , 15 , 20 UNION ALL
SELECT 3 , 11 , 20 UNION ALL
SELECT 4 , 12 , 30 UNION ALL
SELECT 5 , 18 , 30 UNION ALL
SELECT 6 , 9 , 40 UNION ALL
SELECT 7 , 16 , 40 UNION ALL
SELECT 8 , 13 , 30

By far the simplest way is to use a combination of the ranking functions ( Specifically row_number () ) and CTE 's (Common Table Expressions). This may not always be the most efficient, but I'll come back to that point later.

Our first task is to ascertain the order that the runners completed the course in. If we were interested in only the overall, age range aside, order we could simply

Select * 
from #Runners 
order by Time 

and then by adding then TOP clause we can filter the Top X (assuming 2 here)

Select top(2) * 
from #Runners 
order by Time 

However, we do need the TOP X for each age category. As mentioned above we will be using the row_number() ranking function to help us do this. If we were to execute

select * ,row_number() over (order by Time ) as RowN 
from #Runners 
order by Rown 
Runner      Time        Age         RowN
----------- ----------- ----------- --------------------
6           9           40          1
1           10          20          2
3           11          20          3
4           12          30          4
8           13          30          5
2           15          20          6
7           16          40          7
5           18          30          8

We get effectively the same output as before but now with an added column, which is an incrementing integer in the order of time. If there were ties and we wanted each of them treated the same then we would use the DENSE_RANK function rather than ROW_NUMBER. In some systems if there is a tie, for example two runners finished in exactly the same time for first place, then there is no following position. That means no second place runner and we skip to third. We would then use the RANK function. In our example, however, we are going to ignore the possibility of ties.

An extension of the Over clause is "PARTITION BY", which will re-initialize the counter for each group specified with it. This is usable for all the Ranking functions and pre-existing aggregate functions. PARTITION BY is very similar to using a GROUP BY clause except now we are able to have a different clause for each ranking or aggregate function used. To see this is action

select * ,row_number() over (partition by Age order by Time ) as RowN 
from #Runners 
order by Age,Rown 

So now RowN is an incrementing counter ordered by time for each distinct value within the age column. This is the column that we will now need to filter upon to find our TOP X runners. In an ideal world we would be able to execute:

select * ,row_number() over (partition by Age order by Time ) as RowN 
from #Runners 
where row_number() over (partition by Age order by Time ) <=2 
order by Age,Rown 

However we cannot, so the currently suggested method is to solve this by using a CTE. This will enable us to now filter on the RowN Column.

with cteRunners 
as 
( 
select * ,row_number() over (partition by Age order by Time ) as RowN 
from #Runners 
) 
Select * from cteRunners 
where RowN <=2 
order by Age,Rown 

Itzik Ben-Gan has proposed using a QUALIFY clause as detailed here to resolve this issue.

The caveat

Although this will work perfectly fine, performance over a large number of rows can be relatively poor as the engine will have to read every row in the table and for those rows, evaluate the row number function so that it can be filtered upon. This can be quite a waste of resources if we only require a small amount of the total number of rows. To prove the point, first we need to generate some test data,

IF OBJECT_ID('tempdb..#RunnersBig') IS NOT NULL drop table #RunnersBig 
go 
Create Table #RunnersBig 
( 
RunnerId integer identity , 
Time integer not null, 
Age integer not null 
) 
go 
insert into #runnersbig ( Time , Age ) 
select top 1000000 ABS ( checksum ( newid ()))% 1000 , 
ABS ( checksum ( newid ()))% 99 
from sys . columns a cross join sys . columns b cross join sys . columns c 
go 
create index idxrunnersbig on #runnersbig ( age , time ) include ( runnerid ) 

This will generate a table with one million rows of randomized values and then create an index upon it. Our alternative approach will be to use the CROSS APPLY Operator and a tally table is used as our 'LEFT' table.

with cteN 
as 
( 
select number from master .. spt_values 
where type = 'p' and number between 0 and 100 
) 
Select * 
from cteN cross apply ( Select top ( 2 ) * from #RunnersBig where #RunnersBig . Age = cteN . number order by Time ) as runners 
order by cteN . number , runners . Time 

When compared to this filtered row_number query in Profiler:

with cteRunners 
as 
( 
select * , row_number () over ( partition by Age order by Time ) as RowN 
from #RunnersBig 
) 
Select * from cteRunners 
where RowN <= 2 
order by Age , Rown 
go 

We will find that the filtered version executes in ~463ms whereas the CROSS APPLY a mere 1 ms YMMV.

Rate

4.63 (72)

You rated this post out of 5. Change rating

Share

Share

Rate

4.63 (72)

You rated this post out of 5. Change rating