Ranking records in query results

  • I need to Rank the results of a query the same way I can Rank records in an Excel spreadsheet.

    In Excel, using the Ranking Function, you can show the rank of each record based on how that record falls in a selected group of records.  If the current record has the same value as other records, it receives the same rank (value) as the others with that value.   Example:

    Record     value   rank

    1             100      1

    2             125      2

    3             125      2

    4             150      3

    5             175      4

    Is there a way to do this with a SQL function, or udf? (easily)

  • I believe SQL Server 2005 has a RANK function, but for now, you're stuck with using T-SQL.  Try this, I just threw it together based on assumptions on data types, etc. from your post. 

    DECLARE @table TABLE (Record int IDENTITY(1,1), value int)

    INSERT INTO @table (value)

    SELECT 100 UNION ALL

    SELECT 125 UNION ALL

    SELECT 125 UNION ALL

    SELECT 150 UNION ALL

    SELECT 175

    DECLARE @rank TABLE (Rank int IDENTITY(1,1), value int)

    INSERT INTO @rank (value)

    SELECT DISTINCT value

    FROM @table

    ORDER BY value

    SELECT t.*, r.Rank

    FROM @table t

        LEFT JOIN @rank r

        ON t.value = r.value

     

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Thank you for the reply!

    I am not sure if this is going to work for me.  I dont know the rank ahead of time of each record, and it appears from your code that I would need to know that.

    The rank field needs to be populated based on the records.  I can get a number applied to the order the records fall in the result set from low to high, but that doesnt give me a true ranking (where identical values receive the same rank).

    Does that make sense?  Here is an example of the type of results I have, and need to rank:

    Record     Value

    1             100

    2             125

    3             100

    4             137

    5             250

    6             200

    7             137

    This is the result set prior to ordering from low to high in the value field.  Also, prior to assigning a rank.  With this data, I need to rank each record.  The rank for record number 1 would be 1.  The rank for record number 3 would also be 1.  The rank for record number 2 would be 2, and so on.  Here is what it should look like....

    Record     Value     Rank

    1             100        1

    2             125        2

    3             100        1

    4             137        3

    5             250        5

    6             200        4

    7             137        3

    I need to figure out the code it would take to get from example 1 to example 2.  I hope that makes more sense. 

    Again, thank you for your reply!

  • Mark, John's code should do that for you, and you don't need to know rank ahead of time. He's creating a table to work with, which takes all of your distinct values and assigns them a ranking. Then he joins that back to the original table to get the rank (the identity of that table he creates on the fly).

    What might be confusing you is his @table table variable. That's just a sample version of your data for testing purposes, so you can replace it with your table name.

  • Did you try it.  Plug your new values into my example and it still works. 

     

    DECLARE @table TABLE (Record int IDENTITY(1,1), value int)

    INSERT INTO @table (value)

    SELECT 100 UNION ALL

    SELECT 125 UNION ALL

    SELECT 100 UNION ALL

    SELECT 137 UNION ALL

    SELECT 250 UNION ALL

    SELECT 200 UNION ALL

    SELECT 137

    DECLARE @rank TABLE (Rank int IDENTITY(1,1), value int)

    INSERT INTO @rank (value)

    SELECT DISTINCT value

    FROM @table

    ORDER BY value

    SELECT t.*, r.Rank

    FROM @table t

        LEFT JOIN @rank r

        ON t.value = r.value

     

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • PS - Thanks for the detailed clarification David.  That's exactly what is happening....

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Try this using the same logic as above:

    SELECT

    *,(select count(value) as rank from @table where value < t.value) + 1 as rank

    FROM

    @table t

  • This solution does not account for ties in the same way that the poster requested.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Hi ,

    Try This Solution , iTS WORKING fINE :

    create table Load_Data

    (

    rowid int identity(1,1),

    data int

    )

    insert into Load_Data

    select 100

    union all

    select 125

    union all

    select  125

    union all

    select 150

    union all

    select 175

    select a.rowid,a.data,count(*) from load_data a , (select distinct data from load_data) b

    where a.data >= b.data

    group by a.rowid,a.data

     

    Thanks & Regards

    Amit Gupta.

     

  • Now I'm curious what the performance difference would be between Amit's and John's solutions on a large table. It appears at first glance that Amit's needs two indexes on one table to perform optimally and John's would need one index on two tables to perform optimally.

    My guess is that John's would outperform Amit's on a large table, but I could be wrong.

  • No, Sir, David... you're not wrong... but not for the reasons you state...

    First, let's build a small table of only 10,000 rows with only the data with which we are concerned... to give everyone the best shot at winning the proverbial foot race which is about to ensue, I've added a Clustered Primary Key to both columns...

    --===== Use a database that is present on all installations where no harm
         -- will come if someone happens to have a test table named "JBMTest".
        USE TEMPDB
    --===== If the test table exists, drop it
         IF OBJECT_ID('dbo.JBMTest') IS NOT NULL
            DROP TABLE dbo.JBMTest
    --===== Create and populate the test table on the fly
     SELECT TOP 10000
            RowNum    = IDENTITY(INT,1,1),
            SomeValue = CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT)
       INTO dbo.JBMTest
       FROM Master.dbo.SysColumns sc1,
            Master.dbo.SysColumns sc2
      ALTER TABLE dbo.JBMTest
      ALTER COLUMN SomeValue INT NOT NULL
    GO
    --===== A table is not properly formed unless a Primary Key has been assigned
      ALTER TABLE dbo.JBMTest
             ADD PRIMARY KEY CLUSTERED (RowNum,SomeValue)
    GO

    Ok, first, John Rowan's script (column names changed to match the test table) with some indication of performance (simple duration)... I did add an ORDER BY so we can see that his code works as expected...

    --========================================================================================================
    --      Test John's solution
    --========================================================================================================
    --===== Declare a duration timer variable and set nocount on to cleanup the display
    DECLARE @StartTime DATETIME
        SET NOCOUNT ON
    --===== Clear and drop cache buffers so everyone starts out the same
       DBCC FREEPROCCACHE
       DBCC FREEPROCCACHE
    --===== Start the timer and run Amit's solution
        SET @StartTime = GETDATE()
    DECLARE @Rank TABLE (Rank INT IDENTITY(1,1), SomeValue INT)
     INSERT INTO @Rank (SomeValue)
     SELECT DISTINCT SomeValue
       FROM JBMTest
      ORDER BY SomeValue 
     SELECT t.*, r.Rank
       FROM JBMTest t
       LEFT JOIN @Rank r
         ON t.SomeValue = r.SomeValue
      ORDER BY t.SomeValue
    --===== Display John's run time
      PRINT 'John''s script: ' + CAST(DATEDIFF(ms,@StartTime,GETDATE()) AS VARCHAR(10)) + ' milliseconds.'
      PRINT REPLICATE('=',92)
    GO

    John's "divide and conquer" method runs inbetween 420 and 536 milliseconds on my box... it produces a nearly 100% cpu spike for about the same amount of time as it runs.  Pretty good.

    Now, let's try Amit's... as with John's code, I did add an ORDER BY so we can see that his code works as expected...

    --========================================================================================================
    --     Test Amit's solution
    --========================================================================================================
    --===== Declare a duration timer variable and set nocount on to cleanup the display
    DECLARE @StartTime DATETIME
        SET NOCOUNT ON
    --===== Clear and drop cache buffers so everyone starts out the same
       DBCC FREEPROCCACHE
       DBCC FREEPROCCACHE
    --===== Start the timer and run Amit's solution
        SET @StartTime = GETDATE()
     SELECT a.RowNum,a.SomeValue,COUNT(*) AS Rank FROM JBMTest a , (SELECT DISTINCT SomeValue FROM JBMTest) b
      WHERE a.SomeValue >= b.SomeValue
      GROUP BY a.RowNum,a.SomeValue
      ORDER BY a.SomeValue
    --===== Display Amit's run time
      PRINT 'Amit''s script: ' + CAST(DATEDIFF(ms,@StartTime,GETDATE()) AS VARCHAR(10)) + ' milliseconds.'
      PRINT REPLICATE('=',92)

    Yeaowch!  About 58 seconds (58,796 milliseconds on my box) and it virtually locked up my single processor box at 100% during the whole run!  Amit, please don't take what I'm getting ready to say personally... I've got to talk a bit about this very common performance mistake that good folks, like yourself, sometimes make.

    Amit's code is actually worse than a cursor... it's hidden RBAR (pronounced "ree-bar" and is a "Modenism" for "Row by Agonizing Row) on steriods!  Just because it doesn't have a loop, doesn't mean that it's trully set based!  The key is this line in the WHERE clause...

    WHERE a.SomeValue >= b.SomeValue

    THAT, Ladies and Gentlemen, is known as a "triangular join".  A full Cartesian join is sometimes called a "square join" because it's product forms a square (like a times table would).  A "triangular join" is either a little bit more than half a Cartesian join for ">=" or just a little less than half a Cartesian join for ">".  If table "a" and "b" had the same number of rows, the formula for how many internal rows that are spawned would be ((A*B)+A)/2.  Let's plug in the number "10000" for "A" and "B"... you can do the math... it comes out to 50,005,000 rows!!!!  Because table "B" is actually a bit smaller (DISTINCT), the number will be a bit less, but if you look at the estimated execution plan (on the arrow just to the left of the Loop Join) for Amit's code, you'll still find a HUGE number something like 45,299,832!!!  That's the number of rows that SQL Server had to touch to come up with the answer even if it had to touch many of the rows more than once (kinda obvious here).

    Not all triangular joins are as devastating as this one, BUT, if you see one in code, you really need to check it out because the number of internal rows created increases by two orders of magnitude for every order of increase.  For example, if "a" and "b" have the same number of rows at 100,000, the number of internal rows created is a whopping 5,000,050,000!!!!  How long do you thing THAT will take???

    By the way... I did run John's code on a million rows and, including the time it took to display in the grid in QA, it only took 20 seconds... don't try that on a triangular join of a million rows because the number of internal rows generated will approach 500,000,500,000 rows... that's 500 BILLION rows... last time I tried something like that, it took 3 days before I finally killed it.

    Camilo's code has a nearly identical problem... it's a triangular join.  But, because it occurs in the SELECT list as a correlated sub-query, it's faster than Amit's at about 11 seconds.  If you MUST do it all in a single query (not sure why people think they need to do that), then restrict the triangular join to the SELECT list like Camilo did.  Just keep in mind that the divide and conquer method that John used runs about 22 times faster...

    Like I said, Amit and, now, Camilo... please don't take this bit of diatribe on my part personally... but I had to let folks know that triangular joins to solve things like Ranking are some of the worst possible performance hits you could build into code. 

    Finally, I'm not going to contaminate my mind by writing a cursor demo , but previous experience with evaluating other's code indicates that a well written cursor to solve this problem would operate at about 6,000 rows per second (on my box)... that would put the run time at about 1.6 seconds... that's still 3 times slower than the divide and conquer method John used... and it uses a whole lot more resources... and is much more complicated to write... and STILL can't be done in a single query.  John's code could be made to return a table variable function if you really need a single query.

    Again, be very wary of triangular joins no matter where they appear.  Like I said, they're not all bad, but most of them are.

    Oh yeah... for all those that think speed doesn't matter... look at the stats I just posted... would you rather lockup a machine for 500 milliseconds or 58 seconds?  Speed matters... it's an indication of properly written and effective code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Very nice work, Jeff!

    I didn't know that I stated any reasons (I'm guessing you are commenting on my references to indexes, but to me they're a wash, as they are one time things).

    Actually, what led me to believe that John's would be faster was what you found, but in combination with the "SELECT DISTINCT SomeValue FROM JBMTest".

  • Heh... yeah, I was talking about the indexes.  BTW (dunno if you took it this way)  Didn't mean to make it sound like that whole thing was directed at you or anything like that... I've just seen lot's of people get burned by triangular joins because they seem to work great with a small number of rows and die quickly in the face of scalability.  Figured now was as good as any to beat that drum  

    And thanks for the compliment.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Great post Jeff.  I wanted to run through a test after David posted the question of performance, but I didn't have the time to get one put together.  I did not, however, think the performance difference would be that great.  I have yet to run into any code making use of triangular joins.  This is all very useful info.  Thanks again - it's folks like you that make SSC an awesome resource.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Great post Jeff...do you have a reference on Trinagular joins? I've not heard this term before but have run into the problem found in the code provided...

    Cheers,

    Ben


    Cheers,

    Ben Sullins
    bensullins.com
    Beer is my primary key...

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply