Ranking records in query results

Question

Ranking records in query results

mark degroot-146669

SSC Veteran

Points: 288
More actions
May 10, 2007 at 2:10 pm

#117098

I need to Rank the results of a query the same way I can Rank records in an Excel spreadsheet.
In Excel, using the Ranking Function, you can show the rank of each record based on how that record falls in a selected group of records. If the current record has the same value as other records, it receives the same rank (value) as the others with that value.   Example:
Record     value   rank
1             100      1
2             125      2
3             125      2
4             150      3
5             175      4
Is there a way to do this with a SQL function, or udf? (easily)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

John Rowan SSC Guru Points: 56440 More actions · Answer 1

I believe SQL Server 2005 has a RANK function, but for now, you're stuck with using T-SQL. Try this, I just threw it together based on assumptions on data types, etc. from your post.

DECLARE @table TABLE (Record int IDENTITY(1,1), value int)

INSERT INTO @table (value)

SELECT 100 UNION ALL

SELECT 125 UNION ALL

SELECT 150 UNION ALL

SELECT 175

DECLARE @rank TABLE (Rank int IDENTITY(1,1), value int)

INSERT INTO @rank (value)

SELECT DISTINCT value

FROM @table

ORDER BY value

SELECT t.*, r.Rank

FROM @table t

LEFT JOIN @rank r

ON t.value = r.value

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

mark degroot-146669 SSC Veteran Points: 288 More actions · Answer 2

Thank you for the reply!

I am not sure if this is going to work for me. I dont know the rank ahead of time of each record, and it appears from your code that I would need to know that.

The rank field needs to be populated based on the records. I can get a number applied to the order the records fall in the result set from low to high, but that doesnt give me a true ranking (where identical values receive the same rank).

Does that make sense? Here is an example of the type of results I have, and need to rank:

Record Value

1 100

2 125

3 100

4 137

5 250

6 200

7 137

This is the result set prior to ordering from low to high in the value field. Also, prior to assigning a rank. With this data, I need to rank each record. The rank for record number 1 would be 1. The rank for record number 3 would also be 1. The rank for record number 2 would be 2, and so on. Here is what it should look like....

Record Value Rank

1 100 1

2 125 2

3 100 1

4 137 3

5 250 5

6 200 4

7 137 3

I need to figure out the code it would take to get from example 1 to example 2. I hope that makes more sense.

Again, thank you for your reply!

David McFarland SSChampion Points: 11815 More actions · Answer 3

Mark, John's code should do that for you, and you don't need to know rank ahead of time. He's creating a table to work with, which takes all of your distinct values and assigns them a ranking. Then he joins that back to the original table to get the rank (the identity of that table he creates on the fly).

What might be confusing you is his @table table variable. That's just a sample version of your data for testing purposes, so you can replace it with your table name.

John Rowan SSC Guru Points: 56440 More actions · Answer 4

Did you try it. Plug your new values into my example and it still works.

DECLARE @table TABLE (Record int IDENTITY(1,1), value int)

INSERT INTO @table (value)

SELECT 100 UNION ALL

SELECT 125 UNION ALL

SELECT 100 UNION ALL

SELECT 137 UNION ALL

SELECT 250 UNION ALL

SELECT 200 UNION ALL

SELECT 137

DECLARE @rank TABLE (Rank int IDENTITY(1,1), value int)

INSERT INTO @rank (value)

SELECT DISTINCT value

FROM @table

ORDER BY value

SELECT t.*, r.Rank

FROM @table t

LEFT JOIN @rank r

ON t.value = r.value

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

John Rowan SSC Guru Points: 56440 More actions · Answer 5

PS - Thanks for the detailed clarification David. That's exactly what is happening....

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Camilo Torres SSCrazy Points: 2034 More actions · Answer 6

Try this using the same logic as above:

SELECT

*,(select count(value) as rank from @table where value < t.value) + 1 as rank

FROM

@table t

John Rowan SSC Guru Points: 56440 More actions · Answer 7

This solution does not account for ties in the same way that the poster requested.

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

AMIT GUPTA-263376 SSCarpal Tunnel Points: 4244 More actions · Answer 8

Hi ,

Try This Solution , iTS WORKING fINE :

create table Load_Data

(

rowid int identity(1,1),

data int

)

insert into Load_Data

select 100

union all

select 125

union all

select 125

union all

select 150

union all

select 175

select a.rowid,a.data,count(*) from load_data a , (select distinct data from load_data) b

where a.data >= b.data

group by a.rowid,a.data

Thanks & Regards

Amit Gupta.

David McFarland SSChampion Points: 11815 More actions · Answer 9

Now I'm curious what the performance difference would be between Amit's and John's solutions on a large table. It appears at first glance that Amit's needs two indexes on one table to perform optimally and John's would need one index on two tables to perform optimally.

My guess is that John's would outperform Amit's on a large table, but I could be wrong.

Jeff Moden SSC Guru Points: 1004313 More actions · Answer 10

No, Sir, David... you're not wrong... but not for the reasons you state...

First, let's build a small table of only 10,000 rows with only the data with which we are concerned... to give everyone the best shot at winning the proverbial foot race which is about to ensue, I've added a Clustered Primary Key to both columns...

--===== Use a database that is present on all installations where no harm
     -- will come if someone happens to have a test table named "JBMTest".
    USE TEMPDB

--===== If the test table exists, drop it
     IF OBJECT_ID('dbo.JBMTest') IS NOT NULL
        DROP TABLE dbo.JBMTest

--===== Create and populate the test table on the fly
 SELECT TOP 10000
        RowNum    = IDENTITY(INT,1,1),
        SomeValue = CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT)
   INTO dbo.JBMTest
   FROM Master.dbo.SysColumns sc1,
        Master.dbo.SysColumns sc2

  ALTER TABLE dbo.JBMTest
  ALTER COLUMN SomeValue INT NOT NULL
GO
--===== A table is not properly formed unless a Primary Key has been assigned
  ALTER TABLE dbo.JBMTest
         ADD PRIMARY KEY CLUSTERED (RowNum,SomeValue)
GO

Ok, first, John Rowan's script (column names changed to match the test table) with some indication of performance (simple duration)... I did add an ORDER BY so we can see that his code works as expected...

--========================================================================================================
--      Test John's solution
--========================================================================================================
--===== Declare a duration timer variable and set nocount on to cleanup the display
DECLARE @StartTime DATETIME
    SET NOCOUNT ON

--===== Clear and drop cache buffers so everyone starts out the same
   DBCC FREEPROCCACHE
   DBCC FREEPROCCACHE

--===== Start the timer and run Amit's solution
    SET @StartTime = GETDATE()
DECLARE @Rank TABLE (Rank INT IDENTITY(1,1), SomeValue INT)

 INSERT INTO @Rank (SomeValue)
 SELECT DISTINCT SomeValue
   FROM JBMTest
  ORDER BY SomeValue

 SELECT t.*, r.Rank
   FROM JBMTest t
   LEFT JOIN @Rank r
     ON t.SomeValue = r.SomeValue
  ORDER BY t.SomeValue

--===== Display John's run time
  PRINT 'John''s script: ' + CAST(DATEDIFF(ms,@StartTime,GETDATE()) AS VARCHAR(10)) + ' milliseconds.'
  PRINT REPLICATE('=',92)
GO

John's "divide and conquer" method runs inbetween 420 and 536 milliseconds on my box... it produces a nearly 100% cpu spike for about the same amount of time as it runs. Pretty good.

Now, let's try Amit's... as with John's code, I did add an ORDER BY so we can see that his code works as expected...

--========================================================================================================
--     Test Amit's solution
--========================================================================================================
--===== Declare a duration timer variable and set nocount on to cleanup the display
DECLARE @StartTime DATETIME
    SET NOCOUNT ON

--===== Clear and drop cache buffers so everyone starts out the same
   DBCC FREEPROCCACHE
   DBCC FREEPROCCACHE

--===== Start the timer and run Amit's solution
    SET @StartTime = GETDATE()
 SELECT a.RowNum,a.SomeValue,COUNT(*) AS Rank FROM JBMTest a , (SELECT DISTINCT SomeValue FROM JBMTest) b
  WHERE a.SomeValue >= b.SomeValue
  GROUP BY a.RowNum,a.SomeValue
  ORDER BY a.SomeValue

--===== Display Amit's run time
  PRINT 'Amit''s script: ' + CAST(DATEDIFF(ms,@StartTime,GETDATE()) AS VARCHAR(10)) + ' milliseconds.'
  PRINT REPLICATE('=',92)

Yeaowch! About 58 seconds (58,796 milliseconds on my box) and it virtually locked up my single processor box at 100% during the whole run! Amit, please don't take what I'm getting ready to say personally... I've got to talk a bit about this very common performance mistake that good folks, like yourself, sometimes make.

Amit's code is actually worse than a cursor... it's hidden RBAR (pronounced "ree-bar" and is a "Modenism" for "Row by Agonizing Row) on steriods! Just because it doesn't have a loop, doesn't mean that it's trully set based! The key is this line in the WHERE clause...

WHERE a.SomeValue >= b.SomeValue

THAT, Ladies and Gentlemen, is known as a "triangular join". A full Cartesian join is sometimes called a "square join" because it's product forms a square (like a times table would). A "triangular join" is either a little bit more than half a Cartesian join for ">=" or just a little less than half a Cartesian join for ">". If table "a" and "b" had the same number of rows, the formula for how many internal rows that are spawned would be ((A*B)+A)/2. Let's plug in the number "10000" for "A" and "B"... you can do the math... it comes out to 50,005,000 rows!!!! Because table "B" is actually a bit smaller (DISTINCT), the number will be a bit less, but if you look at the estimated execution plan (on the arrow just to the left of the Loop Join) for Amit's code, you'll still find a HUGE number something like 45,299,832!!! That's the number of rows that SQL Server had to touch to come up with the answer even if it had to touch many of the rows more than once (kinda obvious here).

Not all triangular joins are as devastating as this one, BUT, if you see one in code, you really need to check it out because the number of internal rows created increases by two orders of magnitude for every order of increase. For example, if "a" and "b" have the same number of rows at 100,000, the number of internal rows created is a whopping 5,000,050,000!!!! How long do you thing THAT will take???

By the way... I did run John's code on a million rows and, including the time it took to display in the grid in QA, it only took 20 seconds... don't try that on a triangular join of a million rows because the number of internal rows generated will approach 500,000,500,000 rows... that's 500 BILLION rows... last time I tried something like that, it took 3 days before I finally killed it.

Camilo's code has a nearly identical problem... it's a triangular join. But, because it occurs in the SELECT list as a correlated sub-query, it's faster than Amit's at about 11 seconds. If you MUST do it all in a single query (not sure why people think they need to do that), then restrict the triangular join to the SELECT list like Camilo did. Just keep in mind that the divide and conquer method that John used runs about 22 times faster...

Like I said, Amit and, now, Camilo... please don't take this bit of diatribe on my part personally... but I had to let folks know that triangular joins to solve things like Ranking are some of the worst possible performance hits you could build into code.

Finally, I'm not going to contaminate my mind by writing a cursor demo , but previous experience with evaluating other's code indicates that a well written cursor to solve this problem would operate at about 6,000 rows per second (on my box)... that would put the run time at about 1.6 seconds... that's still 3 times slower than the divide and conquer method John used... and it uses a whole lot more resources... and is much more complicated to write... and STILL can't be done in a single query. John's code could be made to return a table variable function if you really need a single query.

Again, be very wary of triangular joins no matter where they appear. Like I said, they're not all bad, but most of them are.

Oh yeah... for all those that think speed doesn't matter... look at the stats I just posted... would you rather lockup a machine for 500 milliseconds or 58 seconds? Speed matters... it's an indication of properly written and effective code.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

David McFarland SSChampion Points: 11815 More actions · Answer 11

Very nice work, Jeff!

I didn't know that I stated any reasons (I'm guessing you are commenting on my references to indexes, but to me they're a wash, as they are one time things).

Actually, what led me to believe that John's would be faster was what you found, but in combination with the "SELECT DISTINCT SomeValue FROM JBMTest".

Jeff Moden SSC Guru Points: 1004313 More actions · Answer 12

Heh... yeah, I was talking about the indexes. BTW (dunno if you took it this way) Didn't mean to make it sound like that whole thing was directed at you or anything like that... I've just seen lot's of people get burned by triangular joins because they seem to work great with a small number of rows and die quickly in the face of scalability. Figured now was as good as any to beat that drum

And thanks for the compliment.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

John Rowan SSC Guru Points: 56440 More actions · Answer 13

Great post Jeff. I wanted to run through a test after David posted the question of performance, but I didn't have the time to get one put together. I did not, however, think the performance difference would be that great. I have yet to run into any code making use of triangular joins. This is all very useful info. Thanks again - it's folks like you that make SSC an awesome resource.

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Ben Sullins-437405 SSCertifiable Points: 5397 More actions · Answer 14

Great post Jeff...do you have a reference on Trinagular joins? I've not heard this term before but have run into the problem found in the code provided...

Cheers,

Ben

Cheers,

Ben Sullins
bensullins.com
Beer is my primary key...