In our table we have a list of runners, the time the course was completed in and the runner's age. Our task is to retrieve the fastest X runners in each age range.
CREATE TABLE #Runners ( Runner integer NOT NULL, Time integer NOT NULL, Age integer NOT NULL ) INSERT INTO #Runners SELECT 1 , 10 , 20 UNION ALL SELECT 2 , 15 , 20 UNION ALL SELECT 3 , 11 , 20 UNION ALL SELECT 4 , 12 , 30 UNION ALL SELECT 5 , 18 , 30 UNION ALL SELECT 6 , 9 , 40 UNION ALL SELECT 7 , 16 , 40 UNION ALL SELECT 8 , 13 , 30
By far the simplest way is to use a combination of the ranking functions ( Specifically row_number () ) and CTE 's (Common Table Expressions). This may not always be the most efficient, but I'll come back to that point later.
Our first task is to ascertain the order that the runners completed the course in. If we were interested in only the overall, age range aside, order we could simply
Select * from #Runners order by Time
and then by adding then TOP clause we can filter the Top X (assuming 2 here)
Select top(2) * from #Runners order by Time
However, we do need the TOP X for each age category. As mentioned above we will be using the row_number() ranking function to help us do this. If we were to execute
select * ,row_number() over (order by Time ) as RowN from #Runners order by Rown Runner Time Age RowN ----------- ----------- ----------- -------------------- 6 9 40 1 1 10 20 2 3 11 20 3 4 12 30 4 8 13 30 5 2 15 20 6 7 16 40 7 5 18 30 8
We get effectively the same output as before but now with an added column, which is an incrementing integer in the order of time. If there were ties and we wanted each of them treated the same then we would use the DENSE_RANK function rather than ROW_NUMBER. In some systems if there is a tie, for example two runners finished in exactly the same time for first place, then there is no following position. That means no second place runner and we skip to third. We would then use the RANK function. In our example, however, we are going to ignore the possibility of ties.
An extension of the Over clause is "PARTITION BY", which will re-initialize the counter for each group specified with it. This is usable for all the Ranking functions and pre-existing aggregate functions. PARTITION BY is very similar to using a GROUP BY clause except now we are able to have a different clause for each ranking or aggregate function used. To see this is action
select * ,row_number() over (partition by Age order by Time ) as RowN from #Runners order by Age,Rown
So now RowN is an incrementing counter ordered by time for each distinct value within the age column. This is the column that we will now need to filter upon to find our TOP X runners. In an ideal world we would be able to execute:
select * ,row_number() over (partition by Age order by Time ) as RowN from #Runners where row_number() over (partition by Age order by Time ) <=2 order by Age,Rown
However we cannot, so the currently suggested method is to solve this by using a CTE. This will enable us to now filter on the RowN Column.
with cteRunners as ( select * ,row_number() over (partition by Age order by Time ) as RowN from #Runners ) Select * from cteRunners where RowN <=2 order by Age,Rown
Itzik Ben-Gan has proposed using a QUALIFY clause as detailed here to resolve this issue.
The caveat
Although this will work perfectly fine, performance over a large number of rows can be relatively poor as the engine will have to read every row in the table and for those rows, evaluate the row number function so that it can be filtered upon. This can be quite a waste of resources if we only require a small amount of the total number of rows. To prove the point, first we need to generate some test data,
IF OBJECT_ID('tempdb..#RunnersBig') IS NOT NULL drop table #RunnersBig go Create Table #RunnersBig ( RunnerId integer identity , Time integer not null, Age integer not null ) go insert into #runnersbig ( Time , Age ) select top 1000000 ABS ( checksum ( newid ()))% 1000 , ABS ( checksum ( newid ()))% 99 from sys . columns a cross join sys . columns b cross join sys . columns c go create index idxrunnersbig on #runnersbig ( age , time ) include ( runnerid )
This will generate a table with one million rows of randomized values and then create an index upon it. Our alternative approach will be to use the CROSS APPLY Operator and a tally table is used as our 'LEFT' table.
with cteN as ( select number from master .. spt_values where type = 'p' and number between 0 and 100 ) Select * from cteN cross apply ( Select top ( 2 ) * from #RunnersBig where #RunnersBig . Age = cteN . number order by Time ) as runners order by cteN . number , runners . Time
When compared to this filtered row_number query in Profiler:
with cteRunners as ( select * , row_number () over ( partition by Age order by Time ) as RowN from #RunnersBig ) Select * from cteRunners where RowN <= 2 order by Age , Rown go
We will find that the filtered version executes in ~463ms whereas the CROSS APPLY a mere 1 ms YMMV.