Top 1000 of each group in a table

Question

Top 1000 of each group in a table

Gabriel P

SSCarpal Tunnel

Points: 4289
More actions
March 27, 2009 at 11:53 am

#207910

I have a situation where I need to find the top 10 entries per month for the last 5 years. One way to do it is to create 60 different select top 10...order by statements all combined by union all, but I'm guessing there's a more elegant way to do the same thing.
Here is some code to give an idea of what data I'm working with:
drop table #temp;
drop table #nums;
with t1 as (select 1 n union all select 1 n),
t2 as (select 1 n from t1 a, t1 b),
t3 as (select 1 n from t2 a, t2 b),
t4 as (select 1 n from t3 a, t3 b),
t5 as (select 1 n from t4 a, t4 b),
t6 as (select 1 n from t5 a, t5 b),
t7 as (select 1 n from t6 a, t6 b),
t8 as (select top 1000000 ROW_NUMBER() over (order by (select 1)) n from t7)
select CAST(n as int) n
into #nums
from t8
select CAST(n as int) record_id,
Cast(n % 2048 as int) sales_rep_id,
Cast(
Cast(Left(NEWID(),4) as varbinary(8)) as int
) % 384 sales_quantity,
DATEADD(n,0-Cast(Cast(Left(NewID(),4) as varbinary(4)) as int)%2628000,GetDate()) sale_datetime
into #temp
from #nums
--
create clustered index ix_ix on #temp(sale_datetime)
create nonclustered index ix_ix2 on #temp (sales_quantity) include (sales_rep_id)
I want to get the top 10 records (order by sale_quantity desc) by month & year (3/2009, 2/2009, 1/2009, 12/2008, 11/2008, etc).

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

If there are ties, do you want all the people, or just the top 10 rows.

For example, if in December 2008, the top 9 had a sales volume of 500, and then there was a six-place tie for 10th, with 499 each, would you want 10 rows, and too-bad-so-sad for the random 5 who are left off, or do you want 15 rows? What if there's a 5-way tie for 1st place in a month? Do you want 10 rows, or 15?

Also, if a sales_rep_id has more than one entry in a month, should they be summed up, or just take the highest number?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Maxim Picard SSCrazy Points: 2052 More actions · Answer 2

-- EDIT: I just saw Gsquared's post, did not mean to double answer. My solution does not take into consideration ties or if a rep is there more than once. I can rework the query but it should be a good starting point.

Top 10 sales per month

;WITH SalesGroupNumbered

(SalesYear,SalesMonth,SalesRankInMonth,SalesId,SalesRepId,SalesQuantity,SalesDateTime)

AS

(SELECT

YEAR(sale_datetime)

,MONTH(sale_datetime)

,ROW_NUMBER() over (PARTITION BY YEAR(sale_datetime),MONTH(sale_datetime)

ORDER BY sale_datetime ASC, sales_quantity DESC) GroupRowId

,*

FROM #temp)

SELECT *

FROM SalesGroupNumbered

WHERE SalesRankInMonth < 11

ORDER BY SalesYear,SalesMonth,SalesRankInMonth

GSquared SSC Guru Points: 260824 More actions · Answer 3

Yeah, I'm looking at the same type of solution, but wanted to determine if it should be Row_Number, Rank, or Dense_Rank.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Gabriel P SSCarpal Tunnel Points: 4289 More actions · Answer 4

GSquared, I want to return 10 regardless if their are ties. The data set I gave you isn't realistic and has a lot more ties, but for the actual data, there very rarely will be any ties.

We already are using Rank but it is a big table, and proves to be really slow. Selecting by each month (see query below) is actually faster than using Rank(). But you can see from the code why I don't want to use this approach. Compare the IO stats of each of the queries and you'll see why:

Rank():

SET STATISTICS IO ON

SET STATISTICS TIME ON

select * from

(select Row_Number() OVER (partition by DateAdd(m,DateDiff(m,0,sale_datetime),0) ORDER BY sales_quantity desc) [rank],

t1.*

from #temp t1) subtable

where [rank] <= 10

Versus:

select * from

(select top 10 *

from #temp

where sale_datetime >= '3/1/2009' and sale_datetime < '4/1/2009'

order by sales_quantity desc) t1

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '2/1/2009' and sale_datetime < '3/1/2009'

order by sales_quantity desc) t2

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '1/1/2009' and sale_datetime < '2/1/2009'

order by sales_quantity desc) t3

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '12/1/2008' and sale_datetime < '1/1/2009'

order by sales_quantity desc) t4

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '11/1/2008' and sale_datetime < '12/1/2008'

order by sales_quantity desc) t5

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '10/1/2008' and sale_datetime < '11/1/2008'

order by sales_quantity desc) t6

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '9/1/2008' and sale_datetime < '10/1/2008'

order by sales_quantity desc) t7

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '8/1/2008' and sale_datetime < '9/1/2008'

order by sales_quantity desc) t8

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '7/1/2008' and sale_datetime < '8/1/2008'

order by sales_quantity desc) t9

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '6/1/2008' and sale_datetime < '7/1/2008'

order by sales_quantity desc) t10

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '5/1/2008' and sale_datetime < '6/1/2008'

order by sales_quantity desc) t11

UNION ALL

select * from

(select top 10 *

from #temp

where sale_datetime >= '4/1/2008' and sale_datetime < '5/1/2008'

order by sales_quantity desc) t12

(I'm not going to union all 60 months, but I think that's enough to show you which one performs better)

Gabriel P SSCarpal Tunnel Points: 4289 More actions · Answer 5

FYI I just realized I put top 1000 in the title. Please ignore that, it is misleading... I really meant top 10

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 6

Hi Gabriel

So if the performance of your UNION is well, I would just avoid the copy-pasting:

DECLARE @month DATETIME

SET @month = DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()) + 1, 0)

DECLARE @sql VARCHAR(MAX)

SELECT @sql = (SELECT CASE WHEN N = 1 THEN '' ELSE 'UNION ' END +

' SELECT TOP 10 * ' +

' FROM #temp t' + CONVERT(VARCHAR(2), N) +

' WHERE sale_date >= ''' + CONVERT(VARCHAR(40), DATEADD(MONTH, (N - 1) * -1, @month), 127) + '''' +

' AND sale_date < ''' + CONVERT(VARCHAR(40), DATEADD(MONTH, N * -1, @month), 127) + '''' +

CHAR(10)

FROM Tally

WHERE N <= 60 FOR XML PATH(''))

SELECT @sql = REPLACE(REPLACE(@sql, '<', ' ')

PRINT @sql

Gabriel P SSCarpal Tunnel Points: 4289 More actions · Answer 7

I forgot to mention another thing, dynamic sql is strong opposed in this environment :angry:

GSquared SSC Guru Points: 260824 More actions · Answer 8

If hard-coding the date ranges in there works, then yes, that definitely outperforms the Rank/Row_Number version.

An alternative to hard-coding it would be something like this:

declare @sql varchar(max);

select @sql = coalesce(@SQL + 'select * from (select top 10 *

from #temp

where sale_datetime >= ''' + cast(dateadd(month, number, '3/1/2004') as varchar(100)) + ''' and sale_datetime < ''' + cast(dateadd(month, number+1, '3/1/2004') as varchar(100)) + '''

order by sales_quantity desc) X union all

',

'select * from (select top 10 *

from #temp

where sale_datetime >= ''' + cast(dateadd(month, number, '3/1/2004') as varchar(100)) + ''' and sale_datetime < ''' + cast(dateadd(month, number+1, '3/1/2004') as varchar(100)) + '''

order by sales_quantity desc) X union all

', @sql)

from dbo.Numbers

where number between 0 and 60;

select @sql = reverse(stuff(reverse(@SQL), 1, 12, ';')); -- gets rid of final Union All

print @sql; -- For debugging

exec (@SQL);

With that, it'll build all the Union All queries for you. You can also set the date range by using an input parameter instead of the date I hard-coded into it.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 9

Ah. Didn't see the prohibition against dynamic SQL. In that case, even though it performs slightly worse, I'd go with the Row_Number version, because otherwise you're going to have to do something really screwy to build your query each time it needs to be used.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Gabriel P SSCarpal Tunnel Points: 4289 More actions · Answer 10

I figured it out in case anyone is interested.

declare @first_month datetime, @last_month datetime

select @first_month = DATEADD(m,DateDiff(m,0,Min(sale_datetime)),0),

@last_month = DATEADD(m,DateDiff(m,0,Max(sale_datetime)),0)

from #temp

select n.n, t1.*

from #nums n

cross apply (select top 10 *

from #temp

where sale_datetime >= DATEADD(m,n.n-1,@first_month)

and sale_datetime < DATEADD(m,n.n,@first_month)

order by sales_quantity desc) t1

where n.n <= DATEDIFF(m,@first_month,@last_month)

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 11

Nice solution, Gabriel. Thanks for posting it up.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 12

Gabriel P (3/27/2009)
I figured it out in case anyone is interested.
declare @first_month datetime, @last_month datetime
select @first_month = DATEADD(m,DateDiff(m,0,Min(sale_datetime)),0),
@last_month = DATEADD(m,DateDiff(m,0,Max(sale_datetime)),0)
from #temp
select n.n, t1.*
from #nums n
cross apply (select top 10 *
from #temp
where sale_datetime >= DATEADD(m,n.n-1,@first_month)
and sale_datetime < DATEADD(m,n.n,@first_month)
order by sales_quantity desc) t1
where n.n <= DATEDIFF(m,@first_month,@last_month)

With your test data and the code above, that returns 590 rows... if you want 5 years worth, that's 60 months and multiply that by 10 entries, you should be getting 600 rows returned. I'm thinking something isn't quite right... and, yes, I'm taking a look to see what.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 13

Just add +1 to the following and you'll get your 60 months...

where n.n <= DATEDIFF(m,@first_month,@last_month)[font="Arial Black"]+1[/font]

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

tfifield SSCrazy Eights Points: 9655 More actions · Answer 14

I wouldn't use a numbers table on this type of query. I'd rather use a properly indexed calendar type table. I've found that they are very useful for many applications and you don't have to use any sort of calculated dates in a WHERE clause, thus avoiding not being able to use any indexes on date columns.

Here's an example of a simple month type calendar table going from 1/1/2000 to 300 months later:

CREATE TABLE CalMonth

( MonthID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED

, MonthStart DATETIME

, NextMonth DATETIME

, MonthDescr CHAR(6)

)

CREATE INDEX CalMonth_Dates ON CalMonth (MonthStart, NextMonth)

-- Populate the calendar table.

;WITH TallyCTE AS (

SELECT 0 AS MonthAdd

UNION ALL

SELECT TOP 300 ROW_NUMBER() OVER (ORDER BY name)

FROM sys.syscolumns

)

INSERT INTO CalMonth

( MonthStart, NextMonth )

SELECT

DATEADD(month, T.MonthAdd, '1/1/2000')

, DATEADD(Month, T.MonthAdd, '2/1/2000')

FROM TallyCTE T

-- Fill in the month descriptions

UPDATE CalMonth

SET MonthDescr = LEFT(DATENAME(month, MonthStart), 3)

+ '-'

+ RIGHT(CONVERT(VARCHAR, DATEPART(year, MonthStart)), 2)

Here's your query using the current month and 59 months prior to that. There's no date calculations at all:

DECLARE

@FirstMonthINT

, @LastMonthINT

SELECT @LastMonth = MonthID FROM CalMonth

WHERE GETDATE() >= MonthStart

AND GETDATE() < NextMonth

SET @FirstMonth = @LastMonth - 59

SELECT CM.MonthID, CM.MonthDescr AS SalesMonth

, T1.*

FROM CalMonth CM

CROSS APPLY

(SELECT TOP 10 * FROM #temp

WHERE sale_datetime >= CM.MonthStart

AND sale_datetime < CM.NextMonth

ORDER BY #temp.sales_quantity DESC

) AS T1

WHERE

CM.MonthID BETWEEN @FirstMonth And @LastMonth

Put your own ORDER BY at the bottom.

Todd Fifield