May 10, 2007 at 2:10 pm
I need to Rank the results of a query the same way I can Rank records in an Excel spreadsheet.
In Excel, using the Ranking Function, you can show the rank of each record based on how that record falls in a selected group of records. If the current record has the same value as other records, it receives the same rank (value) as the others with that value. Example:
Record value rank
1 100 1
2 125 2
3 125 2
4 150 3
5 175 4
Is there a way to do this with a SQL function, or udf? (easily)
May 10, 2007 at 3:07 pm
I believe SQL Server 2005 has a RANK function, but for now, you're stuck with using T-SQL. Try this, I just threw it together based on assumptions on data types, etc. from your post.
DECLARE @table TABLE (Record int IDENTITY(1,1), value int)
INSERT INTO @table (value)
SELECT 100 UNION ALL
SELECT 125 UNION ALL
SELECT 125 UNION ALL
SELECT 150 UNION ALL
SELECT 175
DECLARE @rank TABLE (Rank int IDENTITY(1,1), value int)
INSERT INTO @rank (value)
SELECT DISTINCT value
FROM @table
ORDER BY value
SELECT t.*, r.Rank
FROM @table t
LEFT JOIN @rank r
ON t.value = r.value
May 11, 2007 at 10:56 am
Thank you for the reply!
I am not sure if this is going to work for me. I dont know the rank ahead of time of each record, and it appears from your code that I would need to know that.
The rank field needs to be populated based on the records. I can get a number applied to the order the records fall in the result set from low to high, but that doesnt give me a true ranking (where identical values receive the same rank).
Does that make sense? Here is an example of the type of results I have, and need to rank:
Record Value
1 100
2 125
3 100
4 137
5 250
6 200
7 137
This is the result set prior to ordering from low to high in the value field. Also, prior to assigning a rank. With this data, I need to rank each record. The rank for record number 1 would be 1. The rank for record number 3 would also be 1. The rank for record number 2 would be 2, and so on. Here is what it should look like....
Record Value Rank
1 100 1
2 125 2
3 100 1
4 137 3
5 250 5
6 200 4
7 137 3
I need to figure out the code it would take to get from example 1 to example 2. I hope that makes more sense.
Again, thank you for your reply!
May 11, 2007 at 11:02 am
Mark, John's code should do that for you, and you don't need to know rank ahead of time. He's creating a table to work with, which takes all of your distinct values and assigns them a ranking. Then he joins that back to the original table to get the rank (the identity of that table he creates on the fly).
What might be confusing you is his @table table variable. That's just a sample version of your data for testing purposes, so you can replace it with your table name.
May 11, 2007 at 11:05 am
Did you try it. Plug your new values into my example and it still works.
DECLARE @table TABLE (Record int IDENTITY(1,1), value int)
INSERT INTO @table (value)
SELECT 100 UNION ALL
SELECT 125 UNION ALL
SELECT 100 UNION ALL
SELECT 137 UNION ALL
SELECT 250 UNION ALL
SELECT 200 UNION ALL
SELECT 137
DECLARE @rank TABLE (Rank int IDENTITY(1,1), value int)
INSERT INTO @rank (value)
SELECT DISTINCT value
FROM @table
ORDER BY value
SELECT t.*, r.Rank
FROM @table t
LEFT JOIN @rank r
ON t.value = r.value
May 11, 2007 at 11:06 am
May 11, 2007 at 11:51 am
Try this using the same logic as above:
SELECT
*,(select count(value) as rank from @table where value < t.value) + 1 as rank
FROM
@table t
May 11, 2007 at 11:55 am
May 14, 2007 at 5:26 am
Hi ,
Try This Solution , iTS WORKING fINE :
create table Load_Data
(
rowid int identity(1,1),
data int
)
insert into Load_Data
select 100
union all
select 125
union all
select 125
union all
select 150
union all
select 175
select a.rowid,a.data,count(*) from load_data a , (select distinct data from load_data) b
where a.data >= b.data
group by a.rowid,a.data
Thanks & Regards
Amit Gupta.
May 14, 2007 at 9:15 am
Now I'm curious what the performance difference would be between Amit's and John's solutions on a large table. It appears at first glance that Amit's needs two indexes on one table to perform optimally and John's would need one index on two tables to perform optimally.
My guess is that John's would outperform Amit's on a large table, but I could be wrong.
May 14, 2007 at 8:43 pm
No, Sir, David... you're not wrong... but not for the reasons you state...
First, let's build a small table of only 10,000 rows with only the data with which we are concerned... to give everyone the best shot at winning the proverbial foot race which is about to ensue, I've added a Clustered Primary Key to both columns...
--===== Use a database that is present on all installations where no harm -- will come if someone happens to have a test table named "JBMTest". USE TEMPDB
--===== If the test table exists, drop it IF OBJECT_ID('dbo.JBMTest') IS NOT NULL DROP TABLE dbo.JBMTest
--===== Create and populate the test table on the fly SELECT TOP 10000 RowNum = IDENTITY(INT,1,1), SomeValue = CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT) INTO dbo.JBMTest FROM Master.dbo.SysColumns sc1, Master.dbo.SysColumns sc2
ALTER TABLE dbo.JBMTest ALTER COLUMN SomeValue INT NOT NULL GO --===== A table is not properly formed unless a Primary Key has been assigned ALTER TABLE dbo.JBMTest ADD PRIMARY KEY CLUSTERED (RowNum,SomeValue) GO
Ok, first, John Rowan's script (column names changed to match the test table) with some indication of performance (simple duration)... I did add an ORDER BY so we can see that his code works as expected...
--======================================================================================================== -- Test John's solution --======================================================================================================== --===== Declare a duration timer variable and set nocount on to cleanup the display DECLARE @StartTime DATETIME SET NOCOUNT ON
--===== Clear and drop cache buffers so everyone starts out the same DBCC FREEPROCCACHE DBCC FREEPROCCACHE
--===== Start the timer and run Amit's solution SET @StartTime = GETDATE() DECLARE @Rank TABLE (Rank INT IDENTITY(1,1), SomeValue INT)
INSERT INTO @Rank (SomeValue) SELECT DISTINCT SomeValue FROM JBMTest ORDER BY SomeValue
SELECT t.*, r.Rank FROM JBMTest t LEFT JOIN @Rank r ON t.SomeValue = r.SomeValue ORDER BY t.SomeValue
--===== Display John's run time PRINT 'John''s script: ' + CAST(DATEDIFF(ms,@StartTime,GETDATE()) AS VARCHAR(10)) + ' milliseconds.' PRINT REPLICATE('=',92) GO
John's "divide and conquer" method runs inbetween 420 and 536 milliseconds on my box... it produces a nearly 100% cpu spike for about the same amount of time as it runs. Pretty good.
Now, let's try Amit's... as with John's code, I did add an ORDER BY so we can see that his code works as expected...
--======================================================================================================== -- Test Amit's solution --======================================================================================================== --===== Declare a duration timer variable and set nocount on to cleanup the display DECLARE @StartTime DATETIME SET NOCOUNT ON
--===== Clear and drop cache buffers so everyone starts out the same DBCC FREEPROCCACHE DBCC FREEPROCCACHE
--===== Start the timer and run Amit's solution SET @StartTime = GETDATE() SELECT a.RowNum,a.SomeValue,COUNT(*) AS Rank FROM JBMTest a , (SELECT DISTINCT SomeValue FROM JBMTest) b WHERE a.SomeValue >= b.SomeValue GROUP BY a.RowNum,a.SomeValue ORDER BY a.SomeValue
--===== Display Amit's run time PRINT 'Amit''s script: ' + CAST(DATEDIFF(ms,@StartTime,GETDATE()) AS VARCHAR(10)) + ' milliseconds.' PRINT REPLICATE('=',92)
Yeaowch! About 58 seconds (58,796 milliseconds on my box) and it virtually locked up my single processor box at 100% during the whole run! Amit, please don't take what I'm getting ready to say personally... I've got to talk a bit about this very common performance mistake that good folks, like yourself, sometimes make.
Amit's code is actually worse than a cursor... it's hidden RBAR (pronounced "ree-bar" and is a "Modenism" for "Row by Agonizing Row) on steriods! Just because it doesn't have a loop, doesn't mean that it's trully set based! The key is this line in the WHERE clause...
WHERE a.SomeValue >= b.SomeValue
THAT, Ladies and Gentlemen, is known as a "triangular join". A full Cartesian join is sometimes called a "square join" because it's product forms a square (like a times table would). A "triangular join" is either a little bit more than half a Cartesian join for ">=" or just a little less than half a Cartesian join for ">". If table "a" and "b" had the same number of rows, the formula for how many internal rows that are spawned would be ((A*B)+A)/2. Let's plug in the number "10000" for "A" and "B"... you can do the math... it comes out to 50,005,000 rows!!!! Because table "B" is actually a bit smaller (DISTINCT), the number will be a bit less, but if you look at the estimated execution plan (on the arrow just to the left of the Loop Join) for Amit's code, you'll still find a HUGE number something like 45,299,832!!! That's the number of rows that SQL Server had to touch to come up with the answer even if it had to touch many of the rows more than once (kinda obvious here).
Not all triangular joins are as devastating as this one, BUT, if you see one in code, you really need to check it out because the number of internal rows created increases by two orders of magnitude for every order of increase. For example, if "a" and "b" have the same number of rows at 100,000, the number of internal rows created is a whopping 5,000,050,000!!!! How long do you thing THAT will take???
By the way... I did run John's code on a million rows and, including the time it took to display in the grid in QA, it only took 20 seconds... don't try that on a triangular join of a million rows because the number of internal rows generated will approach 500,000,500,000 rows... that's 500 BILLION rows... last time I tried something like that, it took 3 days before I finally killed it.
Camilo's code has a nearly identical problem... it's a triangular join. But, because it occurs in the SELECT list as a correlated sub-query, it's faster than Amit's at about 11 seconds. If you MUST do it all in a single query (not sure why people think they need to do that), then restrict the triangular join to the SELECT list like Camilo did. Just keep in mind that the divide and conquer method that John used runs about 22 times faster...
Like I said, Amit and, now, Camilo... please don't take this bit of diatribe on my part personally... but I had to let folks know that triangular joins to solve things like Ranking are some of the worst possible performance hits you could build into code.
Finally, I'm not going to contaminate my mind by writing a cursor demo , but previous experience with evaluating other's code indicates that a well written cursor to solve this problem would operate at about 6,000 rows per second (on my box)... that would put the run time at about 1.6 seconds... that's still 3 times slower than the divide and conquer method John used... and it uses a whole lot more resources... and is much more complicated to write... and STILL can't be done in a single query. John's code could be made to return a table variable function if you really need a single query.
Again, be very wary of triangular joins no matter where they appear. Like I said, they're not all bad, but most of them are.
Oh yeah... for all those that think speed doesn't matter... look at the stats I just posted... would you rather lockup a machine for 500 milliseconds or 58 seconds? Speed matters... it's an indication of properly written and effective code.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 15, 2007 at 8:51 pm
Very nice work, Jeff!
I didn't know that I stated any reasons (I'm guessing you are commenting on my references to indexes, but to me they're a wash, as they are one time things).
Actually, what led me to believe that John's would be faster was what you found, but in combination with the "SELECT DISTINCT SomeValue FROM JBMTest".
May 15, 2007 at 11:57 pm
Heh... yeah, I was talking about the indexes. BTW (dunno if you took it this way) Didn't mean to make it sound like that whole thing was directed at you or anything like that... I've just seen lot's of people get burned by triangular joins because they seem to work great with a small number of rows and die quickly in the face of scalability. Figured now was as good as any to beat that drum
And thanks for the compliment.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 16, 2007 at 3:57 pm
Great post Jeff. I wanted to run through a test after David posted the question of performance, but I didn't have the time to get one put together. I did not, however, think the performance difference would be that great. I have yet to run into any code making use of triangular joins. This is all very useful info. Thanks again - it's folks like you that make SSC an awesome resource.
May 16, 2007 at 4:16 pm
Great post Jeff...do you have a reference on Trinagular joins? I've not heard this term before but have run into the problem found in the code provided...
Cheers,
Ben
Ben Sullins
bensullins.com
Beer is my primary key...
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply