Nasty Fast PERCENT_RANK

Question

Post reply

Nasty Fast PERCENT_RANK

Alan Burstein

SSC Guru

Points: 61141
More actions
June 6, 2016 at 9:59 pm

#326721

Comments posted to this topic are about the item Nasty Fast PERCENT_RANK
"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."
-- Itzik Ben-Gan 2001

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182915 More actions · Answer 1

Eirikur Eiriksson

SSC Guru

Points: 182915

June 6, 2016 at 11:56 pm

#1883502

Excellent article Alan, very nice work indeed!

😎

Alan Burstein SSC Guru Points: 61141 More actions · Answer 2

Thanks Eirikur!

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

mauriciorpp Default port Points: 1472 More actions · Answer 3

this is the kind of query that kills some servers - ranking. it's a bit hard to make programmers understand how important the performance of their code can have a big impact on server speed - they all seems to think that throwing more hardware will solve all world's problems...

unfortunately it does not work that way, thanks for the article!

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

Wonderful work and article Alan!

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Alan Burstein SSC Guru Points: 61141 More actions · Answer 5

Thank you Kevin and Mauricio.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

rstone SSCertifiable Points: 6213 More actions · Answer 6

Thanks. I've been collecting performance counters on several servers. The easy calculations like average, extreme, and deviation tend to wash out performance details. I want to get a "real" picture via a set of several percentiles. Unfortunately, I don't think the %-tiles are easy to aggregate. For example, if I have averages and deviations for data collected every 5 minutes, I can use it to get the hourly values. From what I can tell, I have to recalculate the %-tile from scratch. I noticed that the PERCENT_RANK is costly. A faster method is appreciated.

RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

Ken Hiatt SSC-Addicted Points: 469 More actions · Answer 7

Thanks Alan, it's always fun to learn faster ways to do things, even if it's not something I expect to be using 🙂

While my SQL programming skills are generally sufficient to understand what's going on, there's one thing I'm not seeing the point of. Why are you multiplying the PERCENT_RANK() by 1. ? Or does "@pr = 1.*PERCENT_RANK()..." mean something else I'm not aware of? For that matter, in the line "PercentileRank = 1.0*(rk-1)/(nr-1)" again multiplying something by 1, seems to me to be adding additional overhead. Does it cause a change in the execution plan or maybe the formatting of the final result?

Thanks,

Ken

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 8

From the Article:
Bring on the one million row test harness!!!

Heh... I love it! It's like a war cry! 😀

Very cool article, Alan. Well done.

Hat's off also for honoring Dwain. He was a great man. His passing left a huge gap in this fine community.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

darkhelmutis SSCommitted Points: 1561 More actions · Answer 9

Usually multiplying by 1. or 1.0 is a shortcut to cast from integer to decimal based on datatype precedence. This is what is being done in the latter example: RANK() and COUNT() return integer values so if you divide by integers you will get an integer quotient; multiplying by 1.0 or 1. changes it to decimal. I am not sure why this is being done in the former example though. As stated, the PERCENT_RANK() window function returns a float which is higher in precedence than decimal. A float multiplied by a decimal will still be a float as a result:

Datatype precedence: https://msdn.microsoft.com/en-us/library/ms190309.aspx

Alan Burstein SSC Guru Points: 61141 More actions · Answer 10

Ken Hiatt (6/7/2016)
Thanks Alan, it's always fun to learn faster ways to do things, even if it's not something I expect to be using 🙂
While my SQL programming skills are generally sufficient to understand what's going on, there's one thing I'm not seeing the point of. Why are you multiplying the PERCENT_RANK() by 1. ? Or does "@pr = 1.*PERCENT_RANK()..." mean something else I'm not aware of? For that matter, in the line "PercentileRank = 1.0*(rk-1)/(nr-1)" again multiplying something by 1, seems to me to be adding additional overhead. Does it cause a change in the execution plan or maybe the formatting of the final result?
Thanks,
Ken

It affects the formatting of the final result. The "1.*" trick is something seen around SQLServerCentral to return a decimal value instead of a whole number when dividing whole numbers. Note the difference here:

WITH SampleValues(nr,rn,SomeValue) AS

(

SELECT TOP(5) COUNT(*) OVER (), RANK() OVER (ORDER BY SomeValue), SomeValue

FROM (VALUES (1),(2),(3),(4),(5))t(SomeValue)

)

SELECT

SomeValue,

Percent_Rank_int = (rn-1)/(nr-1), -- whole number

Percent_Rank_numeric = 1.*(rn-1)/(nr-1) -- numeric(32,11)

FROM SampleValues;

Results:

SomeValue Percent_Rank_int Percent_Rank_numeric

----------- -------------------- ---------------------------------------

1 0 0.00000000000

2 0 0.25000000000

3 0 0.50000000000

4 0 0.75000000000

5 1 1.00000000000

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61141 More actions · Answer 11

Jeff Moden (6/7/2016)
From the Article:
Bring on the one million row test harness!!!
Heh... I love it! It's like a war cry! 😀
Very cool article, Alan. Well done.
Hat's off also for honoring Dwain. He was a great man. His passing left a huge gap in this fine community.

Thank you very much Jeff!

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Ken Hiatt SSC-Addicted Points: 469 More actions · Answer 12

Thanks guys, just the info I was missing. Much appreciated.

Mike Is Here Hall of Fame Points: 3348 More actions · Answer 13

Great article.

Can someone explain the following to me please

So now we know the window aggregate function is the culprit. Since we’re not partitioning the data yet (but will in a moment), the row count (nr) only needs to be calculated once. Let’s re-write the pre-2012 version like so:
SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk – 1) / (nr – 1) calculation:
PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)
FROM ##Sales;

I was always under the impression that putting a query in the select clause cause the query to be run for every row. According to what the author wrote it is quite the opposite??

I would normally store the count in a variable first

SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)

Is my understanding incorrect?

Alan Burstein SSC Guru Points: 61141 More actions · Answer 14

Mike Is Here (6/9/2016)
Great article.

Thank you very much. 😀

So now we know the window aggregate function is the culprit. Since we’re not partitioning the data yet (but will in a moment), the row count (nr) only needs to be calculated once. Let’s re-write the pre-2012 version like so:
SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk – 1) / (nr – 1) calculation:
PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)
FROM ##Sales;
I was always under the impression that putting a query in the select clause cause the query to be run for every row. According to what the author wrote it is quite the opposite??

In some cases that is true but not here. When you do a correlated sub query, such as the ones demonstrated in this article[/url], then, yes. This is why correlated subqueries are often evil. In this case, however, there is no reference in the subquery to the outer query and the optimizer is smart enough to know that the result set only needs to be calculated once, which is what happens.

I would normally store the count in a variable first
SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)
Is my understanding incorrect?

This will be equally efficient and is a valid alternative. I did it without a variable for people that would like to use this logic inside a view or inline table valued function (neither of which support variables).

Here's two performance tests against 1,000,000 rows. First, an I/O test.

SET NOCOUNT ON;

SET STATISTICS IO ON;

PRINT CHAR(10)+'Pre-aggregation using a variable:'+CHAR(10)+REPLICATE('-',50);

DECLARE @ROWS int;

SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)

SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk – 1) / (nr – 1) calculation:

PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/@ROWS

FROM ##Sales;

PRINT 'Method from the article:'+CHAR(10)+REPLICATE('-',50);

SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk – 1) / (nr – 1) calculation:

PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)

FROM ##Sales;

SET STATISTICS IO OFF;

Results:

Pre-aggregation using a variable:

--------------------------------------------------

Table '##Sales'. Scan count 1, logical reads 2728, physical reads 1, read-ahead reads 48, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '##Sales'. Scan count 1, logical reads 2731, physical reads 1, read-ahead reads 38, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Method from the article:

--------------------------------------------------

Table '##Sales'. Scan count 6, logical reads 5489, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Notice that doing it both ways produces relatively the same number of reads. The technique I wrote about is just producing the reads all in one batch.

And a 1M row performance test:

EXEC dbo.GenerateSalesData 1, 1000000, 100, 2000; --1M rows, performance test

PRINT CHAR(10)+'Pre-aggregation using a variable:'+CHAR(10)+REPLICATE('-',50);

GO

DECLARE @st datetime = getdate(), @pr numeric(32,11);

DECLARE @ROWS int;

SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)

SELECT @pr = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/@ROWS

FROM ##Sales;

PRINT DATEDIFF(MS,@st,getdate());

GO 5

PRINT 'Method from the article:'+CHAR(10)+REPLICATE('-',50);

GO

DECLARE @st datetime = getdate(), @pr numeric(32,11);

SELECT @pr = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)

FROM ##Sales;

PRINT DATEDIFF(MS,@st,getdate());

GO 5

Results (in MS):

Pre-aggregation using a variable:

--------------------------------------------------

Beginning execution loop

560

566

564

573

550

Batch execution completed 5 times.

Method from the article:

--------------------------------------------------

Beginning execution loop

590

623

587

610

593

Batch execution completed 5 times.

Relatively the same.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001