Top 1000 of each group in a table

  • I have a situation where I need to find the top 10 entries per month for the last 5 years. One way to do it is to create 60 different select top 10...order by statements all combined by union all, but I'm guessing there's a more elegant way to do the same thing.

    Here is some code to give an idea of what data I'm working with:

    drop table #temp;

    drop table #nums;

    with t1 as (select 1 n union all select 1 n),

    t2 as (select 1 n from t1 a, t1 b),

    t3 as (select 1 n from t2 a, t2 b),

    t4 as (select 1 n from t3 a, t3 b),

    t5 as (select 1 n from t4 a, t4 b),

    t6 as (select 1 n from t5 a, t5 b),

    t7 as (select 1 n from t6 a, t6 b),

    t8 as (select top 1000000 ROW_NUMBER() over (order by (select 1)) n from t7)

    select CAST(n as int) n

    into #nums

    from t8

    select CAST(n as int) record_id,

    Cast(n % 2048 as int) sales_rep_id,

    Cast(

    Cast(Left(NEWID(),4) as varbinary(8)) as int

    ) % 384 sales_quantity,

    DATEADD(n,0-Cast(Cast(Left(NewID(),4) as varbinary(4)) as int)%2628000,GetDate()) sale_datetime

    into #temp

    from #nums

    --

    create clustered index ix_ix on #temp(sale_datetime)

    create nonclustered index ix_ix2 on #temp (sales_quantity) include (sales_rep_id)

    I want to get the top 10 records (order by sale_quantity desc) by month & year (3/2009, 2/2009, 1/2009, 12/2008, 11/2008, etc).

  • If there are ties, do you want all the people, or just the top 10 rows.

    For example, if in December 2008, the top 9 had a sales volume of 500, and then there was a six-place tie for 10th, with 499 each, would you want 10 rows, and too-bad-so-sad for the random 5 who are left off, or do you want 15 rows? What if there's a 5-way tie for 1st place in a month? Do you want 10 rows, or 15?

    Also, if a sales_rep_id has more than one entry in a month, should they be summed up, or just take the highest number?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • -- EDIT: I just saw Gsquared's post, did not mean to double answer. My solution does not take into consideration ties or if a rep is there more than once. I can rework the query but it should be a good starting point.

    Top 10 sales per month

    ;WITH SalesGroupNumbered

    (SalesYear,SalesMonth,SalesRankInMonth,SalesId,SalesRepId,SalesQuantity,SalesDateTime)

    AS

    (SELECT

    YEAR(sale_datetime)

    ,MONTH(sale_datetime)

    ,ROW_NUMBER() over (PARTITION BY YEAR(sale_datetime),MONTH(sale_datetime)

    ORDER BY sale_datetime ASC, sales_quantity DESC) GroupRowId

    ,*

    FROM #temp)

    SELECT *

    FROM SalesGroupNumbered

    WHERE SalesRankInMonth < 11

    ORDER BY SalesYear,SalesMonth,SalesRankInMonth

  • Yeah, I'm looking at the same type of solution, but wanted to determine if it should be Row_Number, Rank, or Dense_Rank.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared, I want to return 10 regardless if their are ties. The data set I gave you isn't realistic and has a lot more ties, but for the actual data, there very rarely will be any ties.

    We already are using Rank but it is a big table, and proves to be really slow. Selecting by each month (see query below) is actually faster than using Rank(). But you can see from the code why I don't want to use this approach. Compare the IO stats of each of the queries and you'll see why:

    Rank():

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    select * from

    (select Row_Number() OVER (partition by DateAdd(m,DateDiff(m,0,sale_datetime),0) ORDER BY sales_quantity desc) [rank],

    t1.*

    from #temp t1) subtable

    where [rank] <= 10

    Versus:

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '3/1/2009' and sale_datetime < '4/1/2009'

    order by sales_quantity desc) t1

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '2/1/2009' and sale_datetime < '3/1/2009'

    order by sales_quantity desc) t2

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '1/1/2009' and sale_datetime < '2/1/2009'

    order by sales_quantity desc) t3

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '12/1/2008' and sale_datetime < '1/1/2009'

    order by sales_quantity desc) t4

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '11/1/2008' and sale_datetime < '12/1/2008'

    order by sales_quantity desc) t5

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '10/1/2008' and sale_datetime < '11/1/2008'

    order by sales_quantity desc) t6

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '9/1/2008' and sale_datetime < '10/1/2008'

    order by sales_quantity desc) t7

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '8/1/2008' and sale_datetime < '9/1/2008'

    order by sales_quantity desc) t8

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '7/1/2008' and sale_datetime < '8/1/2008'

    order by sales_quantity desc) t9

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '6/1/2008' and sale_datetime < '7/1/2008'

    order by sales_quantity desc) t10

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '5/1/2008' and sale_datetime < '6/1/2008'

    order by sales_quantity desc) t11

    UNION ALL

    select * from

    (select top 10 *

    from #temp

    where sale_datetime >= '4/1/2008' and sale_datetime < '5/1/2008'

    order by sales_quantity desc) t12

    (I'm not going to union all 60 months, but I think that's enough to show you which one performs better)

  • FYI I just realized I put top 1000 in the title. Please ignore that, it is misleading... I really meant top 10

  • Hi Gabriel

    So if the performance of your UNION is well, I would just avoid the copy-pasting:

    DECLARE @month DATETIME

    SET @month = DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()) + 1, 0)

    DECLARE @sql VARCHAR(MAX)

    SELECT @sql = (SELECT CASE WHEN N = 1 THEN '' ELSE 'UNION ' END +

    ' SELECT TOP 10 * ' +

    ' FROM #temp t' + CONVERT(VARCHAR(2), N) +

    ' WHERE sale_date >= ''' + CONVERT(VARCHAR(40), DATEADD(MONTH, (N - 1) * -1, @month), 127) + '''' +

    ' AND sale_date < ''' + CONVERT(VARCHAR(40), DATEADD(MONTH, N * -1, @month), 127) + '''' +

    CHAR(10)

    FROM Tally

    WHERE N <= 60 FOR XML PATH(''))

    SELECT @sql = REPLACE(REPLACE(@sql, '<', ' ')

    PRINT @sql

  • I forgot to mention another thing, dynamic sql is strong opposed in this environment :angry:

  • If hard-coding the date ranges in there works, then yes, that definitely outperforms the Rank/Row_Number version.

    An alternative to hard-coding it would be something like this:

    declare @SQL varchar(max);

    select @SQL = coalesce(@SQL + 'select * from (select top 10 *

    from #temp

    where sale_datetime >= ''' + cast(dateadd(month, number, '3/1/2004') as varchar(100)) + ''' and sale_datetime < ''' + cast(dateadd(month, number+1, '3/1/2004') as varchar(100)) + '''

    order by sales_quantity desc) X union all

    ',

    'select * from (select top 10 *

    from #temp

    where sale_datetime >= ''' + cast(dateadd(month, number, '3/1/2004') as varchar(100)) + ''' and sale_datetime < ''' + cast(dateadd(month, number+1, '3/1/2004') as varchar(100)) + '''

    order by sales_quantity desc) X union all

    ', @SQL)

    from dbo.Numbers

    where number between 0 and 60;

    select @SQL = reverse(stuff(reverse(@SQL), 1, 12, ';')); -- gets rid of final Union All

    print @SQL; -- For debugging

    exec (@SQL);

    With that, it'll build all the Union All queries for you. You can also set the date range by using an input parameter instead of the date I hard-coded into it.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Ah. Didn't see the prohibition against dynamic SQL. In that case, even though it performs slightly worse, I'd go with the Row_Number version, because otherwise you're going to have to do something really screwy to build your query each time it needs to be used.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I figured it out in case anyone is interested.

    declare @first_month datetime, @last_month datetime

    select @first_month = DATEADD(m,DateDiff(m,0,Min(sale_datetime)),0),

    @last_month = DATEADD(m,DateDiff(m,0,Max(sale_datetime)),0)

    from #temp

    select n.n, t1.*

    from #nums n

    cross apply (select top 10 *

    from #temp

    where sale_datetime >= DATEADD(m,n.n-1,@first_month)

    and sale_datetime < DATEADD(m,n.n,@first_month)

    order by sales_quantity desc) t1

    where n.n <= DATEDIFF(m,@first_month,@last_month)

  • Nice solution, Gabriel. Thanks for posting it up.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Gabriel P (3/27/2009)


    I figured it out in case anyone is interested.

    declare @first_month datetime, @last_month datetime

    select @first_month = DATEADD(m,DateDiff(m,0,Min(sale_datetime)),0),

    @last_month = DATEADD(m,DateDiff(m,0,Max(sale_datetime)),0)

    from #temp

    select n.n, t1.*

    from #nums n

    cross apply (select top 10 *

    from #temp

    where sale_datetime >= DATEADD(m,n.n-1,@first_month)

    and sale_datetime < DATEADD(m,n.n,@first_month)

    order by sales_quantity desc) t1

    where n.n <= DATEDIFF(m,@first_month,@last_month)

    With your test data and the code above, that returns 590 rows... if you want 5 years worth, that's 60 months and multiply that by 10 entries, you should be getting 600 rows returned. I'm thinking something isn't quite right... and, yes, I'm taking a look to see what.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just add +1 to the following and you'll get your 60 months...

    where n.n <= DATEDIFF(m,@first_month,@last_month)[font="Arial Black"]+1[/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I wouldn't use a numbers table on this type of query. I'd rather use a properly indexed calendar type table. I've found that they are very useful for many applications and you don't have to use any sort of calculated dates in a WHERE clause, thus avoiding not being able to use any indexes on date columns.

    Here's an example of a simple month type calendar table going from 1/1/2000 to 300 months later:

    CREATE TABLE CalMonth

    ( MonthID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED

    , MonthStart DATETIME

    , NextMonth DATETIME

    , MonthDescr CHAR(6)

    )

    CREATE INDEX CalMonth_Dates ON CalMonth (MonthStart, NextMonth)

    -- Populate the calendar table.

    ;WITH TallyCTE AS (

    SELECT 0 AS MonthAdd

    UNION ALL

    SELECT TOP 300 ROW_NUMBER() OVER (ORDER BY name)

    FROM sys.syscolumns

    )

    INSERT INTO CalMonth

    ( MonthStart, NextMonth )

    SELECT

    DATEADD(month, T.MonthAdd, '1/1/2000')

    , DATEADD(Month, T.MonthAdd, '2/1/2000')

    FROM TallyCTE T

    -- Fill in the month descriptions

    UPDATE CalMonth

    SET MonthDescr = LEFT(DATENAME(month, MonthStart), 3)

    + '-'

    + RIGHT(CONVERT(VARCHAR, DATEPART(year, MonthStart)), 2)

    Here's your query using the current month and 59 months prior to that. There's no date calculations at all:

    DECLARE

    @FirstMonthINT

    , @LastMonthINT

    SELECT @LastMonth = MonthID FROM CalMonth

    WHERE GETDATE() >= MonthStart

    AND GETDATE() < NextMonth

    SET @FirstMonth = @LastMonth - 59

    SELECT CM.MonthID, CM.MonthDescr AS SalesMonth

    , T1.*

    FROM CalMonth CM

    CROSS APPLY

    (SELECT TOP 10 * FROM #temp

    WHERE sale_datetime >= CM.MonthStart

    AND sale_datetime < CM.NextMonth

    ORDER BY #temp.sales_quantity DESC

    ) AS T1

    WHERE

    CM.MonthID BETWEEN @FirstMonth And @LastMonth

    Put your own ORDER BY at the bottom.

    Todd Fifield

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply