Top n of each value ?

  • Sample code is below.

    What I want to do is list the top 2 records that have the same name for a given batch, so if there are say five records marked "RED" I want to see two of them, if there is only one "PURPLE" I want to see that as well.

    I've got the query working using row_number(), but I'm not sure this is the most efficient way ? I need to return the whole record (including unique columns), hence not using group.

    This will eventually be used on fairly large sets of data - maybe a couple of million rows, although its more likely to be used on far smaller amounts of data on a regular basis.

    Any suggestions ?

    thanks

    -- Create table

    if exists ( select 1 from SYSOBJECTS where NAME = 'TOPTEN' and TYPE = 'U')

    drop table TOPTEN

    go

    create table TOPTEN

    (

    ID int not null primary key clustered,

    NAME varchar(10),

    BATCH int

    )

    -- create data

    insert into TOPTEN (ID, NAME, BATCH)

    select 3456, 'BLUE', 10 union all

    select 91, 'YELLOW', 10 union all

    select 3, 'RED', 10 union all

    select 278, 'BLUE', 10 union all

    select 69, 'YELLOW', 10 union all

    select 10, 'RED', 10 union all

    select 560, 'BLUE', 10 union all

    select 567, 'YELLOW', 10 union all

    select 12, 'RED', 10 union all

    select 66, 'BLACK', 10 union all

    select 1235, 'BLACK', 10 union all

    select 9, 'PURPLE', 10

    -- Return all colours in a batch, but do not return more than two of the same

    select * from

    (

    select *, row_number() over (partition by NAME order by NAME desc) as counter from TOPTEN where BATCH = 10

    ) pis

    where pis.Counter < 3

    order by pis.NAME

  • Eight seconds for around 12 million (simple) rows looks good to me:

    [font="Courier New"]-- Return all colours in a batch, but do not return more than two of the same

    SELECT * FROM

    (

    SELECT *, row_number() OVER (partition BY NAME ORDER BY NAME DESC) AS counter

       FROM TOPTEN a

       CROSS JOIN Numbers b WHERE BATCH = 10 AND number < 1000000

    ) pis

    WHERE pis.Counter < 3

    ORDER BY pis.NAME

    -- 8 seconds

    ;WITH pis AS (SELECT *, row_number() OVER (partition BY NAME ORDER BY NAME DESC) AS counter

       FROM TOPTEN a

       CROSS JOIN Numbers b

       WHERE BATCH = 10 AND number < 1000000)

    SELECT * FROM pis WHERE Counter < 3

    -- 8 seconds

    [/font]

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • It must be faster. If not compare the execution paln by applying these

    http://sqlblogcasts.com/blogs/madhivanan/archive/2008/09/12/return-top-n-rows.aspx


    Madhivanan

    Failing to plan is Planning to fail

  • Madhivanan (12/1/2008)


    It must be faster. If not compare the execution paln by applying these

    http://sqlblogcasts.com/blogs/madhivanan/archive/2008/09/12/return-top-n-rows.aspx

    Only the third method would work - and it's exactly the same as the OP's suggestion.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I'd suggest using the TOP and an ORDER BY within the sub-query instead. Assuming there is an index for the ORDER BY clause to use, I think you'll see better performance for this. If you can get ahold of the latest copy of the SQL Standard (discontinued as of this month), I wrote an article comparing TOP, MAX & ROW_NUMBER for exactly this type of query. TOP works a bit better, especially with large data sets.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (12/1/2008)


    I'd suggest using the TOP and an ORDER BY within the sub-query instead. Assuming there is an index for the ORDER BY clause to use, I think you'll see better performance for this. If you can get ahold of the latest copy of the SQL Standard (discontinued as of this month), I wrote an article comparing TOP, MAX & ROW_NUMBER for exactly this type of query. TOP works a bit better, especially with large data sets.

    Thanks for the tip, Grant, Ill try and get hold of that article. Do you mean this...

    SELECT o.*

    FROM #TOPTEN AS o

    WHERE BATCH = 10

    AND [ID] IN

    (SELECT TOP 2 [ID]

    FROM #TOPTEN

    WHERE [NAME]=o.[NAME] AND BATCH = 10

    ORDER BY [ID] DESC)

    ORDER BY o.[NAME]

    ...which is the same as Madhivanan's Option 1?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks for all the replies,

    I will give the sub-query method a go.

    Grant I will try and hunt down a copy of your article, thanks for the tip.

  • Chris Morris (12/1/2008)


    Thanks for the tip, Grant, Ill try and get hold of that article. Do you mean this...

    SELECT o.*

    FROM #TOPTEN AS o

    WHERE BATCH = 10

    AND [ID] IN

    (SELECT TOP 2 [ID]

    FROM #TOPTEN

    WHERE [NAME]=o.[NAME] AND BATCH = 10

    ORDER BY [ID] DESC)

    ORDER BY o.[NAME]

    ...which is the same as Madhivanan's Option 1?

    Yeah, that's basically it. In most of our instances, we're looking for the latest version of a record, so we're going for one value in each set, not a list. The IN clause may change this negatively.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (12/1/2008)


    Yeah, that's basically it. In most of our instances, we're looking for the latest version of a record, so we're going for one value in each set, not a list. The IN clause may change this negatively.

    Well well. You'd never guess that this construct would be performant. Shows the importance of testing. Thanks!

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Well now, our methods worked, but they were predicated off of joins and a single row out of each set, not multiple rows out of each set. I still think the IN construct might be a pain. However, my testing showed that with smaller data sets the ROW_NUMBER method worked fine, but it started breaking down at larger data sets. MAX worked OK, but frequently not as well as TOP.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Chris Morris (12/1/2008)


    Madhivanan (12/1/2008)


    It must be faster. If not compare the execution paln by applying these

    http://sqlblogcasts.com/blogs/madhivanan/archive/2008/09/12/return-top-n-rows.aspx

    Only the third method would work - and it's exactly the same as the OP's suggestion.

    Apologies, Madhivanan; not only would your third method work but your first method would also be appropriate for the task. My mistake.:blush:

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply