June 6, 2016 at 9:59 pm
Comments posted to this topic are about the item Nasty Fast PERCENT_RANK
-- Itzik Ben-Gan 2001
June 6, 2016 at 11:56 pm
Excellent article Alan, very nice work indeed!
π
June 7, 2016 at 4:49 am
Thanks Eirikur!
-- Itzik Ben-Gan 2001
June 7, 2016 at 5:33 am
this is the kind of query that kills some servers - ranking. it's a bit hard to make programmers understand how important the performance of their code can have a big impact on server speed - they all seems to think that throwing more hardware will solve all world's problems...
unfortunately it does not work that way, thanks for the article!
June 7, 2016 at 7:12 am
Wonderful work and article Alan!
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 7, 2016 at 10:19 am
Thank you Kevin and Mauricio.
-- Itzik Ben-Gan 2001
June 7, 2016 at 10:30 am
Thanks. I've been collecting performance counters on several servers. The easy calculations like average, extreme, and deviation tend to wash out performance details. I want to get a "real" picture via a set of several percentiles. Unfortunately, I don't think the %-tiles are easy to aggregate. For example, if I have averages and deviations for data collected every 5 minutes, I can use it to get the hourly values. From what I can tell, I have to recalculate the %-tile from scratch. I noticed that the PERCENT_RANK is costly. A faster method is appreciated.
RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. π
June 7, 2016 at 9:27 pm
Thanks Alan, it's always fun to learn faster ways to do things, even if it's not something I expect to be using π
While my SQL programming skills are generally sufficient to understand what's going on, there's one thing I'm not seeing the point of. Why are you multiplying the PERCENT_RANK() by 1. ? Or does "@pr = 1.*PERCENT_RANK()..." mean something else I'm not aware of? For that matter, in the line "PercentileRank = 1.0*(rk-1)/(nr-1)" again multiplying something by 1, seems to me to be adding additional overhead. Does it cause a change in the execution plan or maybe the formatting of the final result?
Thanks,
Ken
June 7, 2016 at 11:42 pm
From the Article:
Bring on the one million row test harness!!!
Heh... I love it! It's like a war cry! π
Very cool article, Alan. Well done.
Hat's off also for honoring Dwain. He was a great man. His passing left a huge gap in this fine community.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 8, 2016 at 8:03 am
Usually multiplying by 1. or 1.0 is a shortcut to cast from integer to decimal based on datatype precedence. This is what is being done in the latter example: RANK() and COUNT() return integer values so if you divide by integers you will get an integer quotient; multiplying by 1.0 or 1. changes it to decimal. I am not sure why this is being done in the former example though. As stated, the PERCENT_RANK() window function returns a float which is higher in precedence than decimal. A float multiplied by a decimal will still be a float as a result:
Datatype precedence: https://msdn.microsoft.com/en-us/library/ms190309.aspx
June 8, 2016 at 10:51 am
Ken Hiatt (6/7/2016)
Thanks Alan, it's always fun to learn faster ways to do things, even if it's not something I expect to be using πWhile my SQL programming skills are generally sufficient to understand what's going on, there's one thing I'm not seeing the point of. Why are you multiplying the PERCENT_RANK() by 1. ? Or does "@pr = 1.*PERCENT_RANK()..." mean something else I'm not aware of? For that matter, in the line "PercentileRank = 1.0*(rk-1)/(nr-1)" again multiplying something by 1, seems to me to be adding additional overhead. Does it cause a change in the execution plan or maybe the formatting of the final result?
Thanks,
Ken
It affects the formatting of the final result. The "1.*" trick is something seen around SQLServerCentral to return a decimal value instead of a whole number when dividing whole numbers. Note the difference here:
WITH SampleValues(nr,rn,SomeValue) AS
(
SELECT TOP(5) COUNT(*) OVER (), RANK() OVER (ORDER BY SomeValue), SomeValue
FROM (VALUES (1),(2),(3),(4),(5))t(SomeValue)
)
SELECT
SomeValue,
Percent_Rank_int = (rn-1)/(nr-1), -- whole number
Percent_Rank_numeric = 1.*(rn-1)/(nr-1) -- numeric(32,11)
FROM SampleValues;
Results:
SomeValue Percent_Rank_int Percent_Rank_numeric
----------- -------------------- ---------------------------------------
1 0 0.00000000000
2 0 0.25000000000
3 0 0.50000000000
4 0 0.75000000000
5 1 1.00000000000
-- Itzik Ben-Gan 2001
June 8, 2016 at 10:53 am
Jeff Moden (6/7/2016)
From the Article:
Bring on the one million row test harness!!!Heh... I love it! It's like a war cry! π
Very cool article, Alan. Well done.
Hat's off also for honoring Dwain. He was a great man. His passing left a huge gap in this fine community.
Thank you very much Jeff!
-- Itzik Ben-Gan 2001
June 8, 2016 at 1:23 pm
Thanks guys, just the info I was missing. Much appreciated.
June 9, 2016 at 6:22 am
Great article.
Can someone explain the following to me please
So now we know the window aggregate function is the culprit. Since weβre not partitioning the data yet (but will in a moment), the row count (nr) only needs to be calculated once. Letβs re-write the pre-2012 version like so:
SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk β 1) / (nr β 1) calculation:
PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)
FROM ##Sales;
I was always under the impression that putting a query in the select clause cause the query to be run for every row. According to what the author wrote it is quite the opposite??
I would normally store the count in a variable first
SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)
Is my understanding incorrect?
June 9, 2016 at 7:18 pm
Mike Is Here (6/9/2016)
Great article.
Thank you very much. π
So now we know the window aggregate function is the culprit. Since weβre not partitioning the data yet (but will in a moment), the row count (nr) only needs to be calculated once. Letβs re-write the pre-2012 version like so:
SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk β 1) / (nr β 1) calculation:
PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)
FROM ##Sales;
I was always under the impression that putting a query in the select clause cause the query to be run for every row. According to what the author wrote it is quite the opposite??
In some cases that is true but not here. When you do a correlated sub query, such as the ones demonstrated in this article[/url], then, yes. This is why correlated subqueries are often evil. In this case, however, there is no reference in the subquery to the outer query and the optimizer is smart enough to know that the result set only needs to be calculated once, which is what happens.
I would normally store the count in a variable first
SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)
Is my understanding incorrect?
This will be equally efficient and is a valid alternative. I did it without a variable for people that would like to use this logic inside a view or inline table valued function (neither of which support variables).
Here's two performance tests against 1,000,000 rows. First, an I/O test.
SET NOCOUNT ON;
SET STATISTICS IO ON;
PRINT CHAR(10)+'Pre-aggregation using a variable:'+CHAR(10)+REPLICATE('-',50);
DECLARE @ROWS int;
SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)
SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk β 1) / (nr β 1) calculation:
PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/@ROWS
FROM ##Sales;
PRINT 'Method from the article:'+CHAR(10)+REPLICATE('-',50);
SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk β 1) / (nr β 1) calculation:
PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)
FROM ##Sales;
SET STATISTICS IO OFF;
Results:
Pre-aggregation using a variable:
--------------------------------------------------
Table '##Sales'. Scan count 1, logical reads 2728, physical reads 1, read-ahead reads 48, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '##Sales'. Scan count 1, logical reads 2731, physical reads 1, read-ahead reads 38, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Method from the article:
--------------------------------------------------
Table '##Sales'. Scan count 6, logical reads 5489, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Notice that doing it both ways produces relatively the same number of reads. The technique I wrote about is just producing the reads all in one batch.
And a 1M row performance test:
EXEC dbo.GenerateSalesData 1, 1000000, 100, 2000; --1M rows, performance test
PRINT CHAR(10)+'Pre-aggregation using a variable:'+CHAR(10)+REPLICATE('-',50);
GO
DECLARE @st datetime = getdate(), @pr numeric(32,11);
DECLARE @ROWS int;
SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)
SELECT @pr = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/@ROWS
FROM ##Sales;
PRINT DATEDIFF(MS,@st,getdate());
GO 5
PRINT 'Method from the article:'+CHAR(10)+REPLICATE('-',50);
GO
DECLARE @st datetime = getdate(), @pr numeric(32,11);
SELECT @pr = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)
FROM ##Sales;
PRINT DATEDIFF(MS,@st,getdate());
GO 5
Results (in MS):
Pre-aggregation using a variable:
--------------------------------------------------
Beginning execution loop
560
566
564
573
550
Batch execution completed 5 times.
Method from the article:
--------------------------------------------------
Beginning execution loop
590
623
587
610
593
Batch execution completed 5 times.
Relatively the same.
-- Itzik Ben-Gan 2001
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply