Returning the Top X row for each group

  • ganotedp (11/23/2012)


    Why not DENSE_RANK instead of ROW_NUMBER?

    Hi ganotedp,

    The article does state

    If there were ties and we wanted each of them treated the same then we would use the DENSE_RANK function rather than ROW_NUMBER. In some systems if there is a tie, for example two runners finished in exactly the same time for first place, then there is no following position. That means no second place runner and we skip to third. We would then use the RANK function. In our example, however, we are going to ignore the possibility of ties.



    Clear Sky SQL
    My Blog[/url]

  • Great article - especially the CROSS APPLY option. I've come across the need for this several times (although sadly have been constrained by SQL 2000, which doesn't work with either option).

  • Thank you.

  • Nice article.

  • It's a real shame that websites don't last forever.  I'll bet that was interesting especially based on the old "beyondrelational" site, which was an awesome site in its heyday.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • More people need to know about the method that Dave showed in this article.

    On modern machines, we still get 3 orders of magnitude difference in CPU and Duration (shown as micros seconds below) and a healthy order of magnitude difference in the number of reads.

    Thanks again for a great article, Dave!

    @steve-2 Jones - Perhaps changing the name of this article to what it really is would help people find it... For example...

    "A MUCH FASTER Method for Returning the Top X Rows for Each Group"

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • 14 YEARS ago,  my god!  Always makes me feel warm inside when then notification of such a blast from the past lands in my inbox.

    Maybe its time not just to change the Article title but a whole curated collection of "Techniques that have stood the test of time".

     



    Clear Sky SQL
    My Blog[/url]

  • Dave Ballantyne wrote:

    Maybe its time not just to change the Article title but a whole curated collection of "Techniques that have stood the test of time".

    It would be interesting to see something like this.  In another topic Jeff Moden mentioned he has a "package" which compares SQL across different year releases.  Maybe using sqlcmd and Powershell or bash

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • ratbak wrote:

    From the wayback machine: https://web.archive.org/web/20111212031052/http://beyondrelational.com/blogs/madhivanan/archive/2008/09/12/return-top-n-rows.aspx%5B/quote%5D

    Holy shades of Herman and Mr. Peabody!  I always forget about the "Wayback" machine. Thank you for posting that link.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Steve Collins wrote:

    In another topic Jeff Moden mentioned he has a "package" which compares SQL across different year releases. 

    I did?  I don't remember... I've gotten old enough where I think my harddrive is full and, every time I learn something new, either something else in my mind has to be super-compressed (making it non-searchable and I don't remember where I put the "inventory" 😀 ) or sometime just falls out of an ear. 😀  Sometimes a good burp will cause such a loss of memory. 😀

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Dave Ballantyne wrote:

    14 YEARS ago,  my god!  Always makes me feel warm inside when then notification of such a blast from the past lands in my inbox.

    Maybe its time not just to change the Article title but a whole curated collection of "Techniques that have stood the test of time".

    There are a bunch of things and thinks (aka, "Black Arts of SQL") that should not be forgotten ever.  A recent great example of "why" is because of the new GENERATE_SERIES() function that came out with 2022.  Supposedly, you don't need something like fnTally or Itzik Ben-Gan's "GetNum" function any more.  That's a big WRONG, especially if you're using it as a row source to generate a shedload of test data (52 GB test table, for example) and you don't want a 52.3 GB or larger transaction log on your laptap (even in the Simple Recovery Model).  You need to use "Minimal logging" but the new GENERATE_SERIES() function breaks that and creates an 87.95 GB transaction log file just to see if your test box has the "right stuff" not to mention taking almost 14 minutes.

    Using the old ways, it only takes 2 minutes and 41 seconds and only 600 MB of log file.

    I'm all in favor of improving the SEO/Findability of these old articles and, maybe, adding a little rework to them without losing the original read stats on them.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    I did?  I don't remember... I've gotten old enough where I think my harddrive is full and, every time I learn something new, either something else in my mind has to be super-compressed (making it non-searchable and I don't remember where I put the "inventory" 😀 ) or sometime just falls out of an ear. 😀  Sometimes a good burp will cause such a loss of memory. 😀

    Well maybe not.  This was referencing the "A Different View of Technical Debt" topic.  You mentioned potentially putting together a package to test "what has happened with SQL Server 2022".  Maybe I read into it a bit.  Unit testing t-SQL has been on my mind lately and I'm trying to come up with situations which could be realistic use cases.  The list of issues in your post are important and I'm wondering if/how there're good ways to formalize it more.   It would seem in this case the unit tests could be useful both as learning and also for tracking changes

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Jeff Moden wrote:

    More people need to know about the method that Dave showed in this article.

    On modern machines, we still get 3 orders of magnitude difference in CPU and Duration (shown as micros seconds below) and a healthy order of magnitude difference in the number of reads.

    Thanks again for a great article, Dave!

    @steve-2 Jones - Perhaps changing the name of this article to what it really is would help people find it... For example...

    "A MUCH FASTER Method for Returning the Top X Rows for Each Group"

    It is a really nice approach.  Nice job by the author.  Is it a general method or only for cases where the lefthand side table can be replaced with a numbers/tally table/function?  Imo the lefthand range of values could often be unknown.  I ran the query from the article 3 ways: 1) with a window function, 2) with a tally function, and 3) with CTE and SELECT DISTINCT to get the lefthand side ages.  The tally is like a 100x faster.  But even with SELECT DISTINCT it's still about 2x faster

    drop Table if exists #RunnersBig;
    go
    Create Table #RunnersBig(
    RunnerId integer identity ,
    Time integer not null,
    Age integer not null);

    insert into #runnersbig ( Time , Age )
    select top 100000 ABS ( checksum ( newid ()))% 1000 ,
    ABS ( checksum ( newid ()))% 99
    from sys.columns a
    cross join sys.columns b
    cross join sys.columns c;
    create index idxrunnersbig on #runnersbig ( age , time ) include ( runnerid );
    --select top 100 * from #RunnersBig;

    set statistics io on;
    set statistics time on;

    /* window function */
    with cteRunners as (
    select * ,row_number() over (partition by Age order by Time ) as RowN
    from #RunnersBig)
    Select * from cteRunners
    where RowN <=2
    order by Age,Rown;

    /* tally */
    Select *
    from dbo.fnTally(1, 100) cteN cross apply ( Select top ( 2 ) *
    from #RunnersBig
    where #RunnersBig.Age = cteN.N
    order by Time ) as runners
    order by cteN.N, runners.Time ;

    /* distinct left side of join */
    with cteN as (
    select distinct Age
    from #RunnersBig)
    Select *
    from cteN cross apply ( Select top ( 2 ) *
    from #RunnersBig
    where #RunnersBig.Age = cteN.Age
    order by Time ) as runners
    order by cteN.Age, runners.Time;

    set statistics io off;
    set statistics time off;

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Steve Collins wrote:

    Jeff Moden wrote:

    I did?  I don't remember... I've gotten old enough where I think my harddrive is full and, every time I learn something new, either something else in my mind has to be super-compressed (making it non-searchable and I don't remember where I put the "inventory" 😀 ) or sometime just falls out of an ear. 😀  Sometimes a good burp will cause such a loss of memory. 😀

    Well maybe not.  This was referencing the "A Different View of Technical Debt" topic.  You mentioned potentially putting together a package to test "what has happened with SQL Server 2022".  Maybe I read into it a bit.  Unit testing t-SQL has been on my mind lately and I'm trying to come up with situations which could be realistic use cases.  The list of issues in your post are important and I'm wondering if/how there're good ways to formalize it more.   It would seem in this case the unit tests could be useful both as learning and also for tracking changes

    Ah.. got it.  I spent a large amount of time last night on the code related to that and discovered there's another problem... somewhere along the line, either a Windows update slowed 2017 down on my box or some company provided updates did.  I have documented run times in presentations with the code that was executed to get the run times.  Some things that use to take ~1500ms to execute and they now take ~2000ms.  That doesn't sound devastating on 20 million rows but consider this... it's common code and it now takes ~33% longer to run.  Do that with about 80% of the code you have running in production and wonder why I'm so very ticked off at what has been going on with Windows Updates, CUs, and other stuff.

    It's difficult to identify a problem when they've changed things that affect the source of the truth, namely the old revision, SQL Serve 2017.

    The "good" part is, hopefully that change will affect 2022 on the same box in the same way.  I'm just totaly losing it that some very well running code is suddenly running 33% slower without any help from me at all.

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 31 through 44 (of 44 total)

You must be logged in to reply to this topic. Login to reply