PERCENTILE_DISC for median calculation

Question

PERCENTILE_DISC for median calculation

Viewing 15 posts - 16 through 30 (of 42 total)

You must be logged in to reply to this topic. Login to reply

valeryk2000 SSCarpal Tunnel Points: 4245 More actions · Answer 1

Jeff Moden - Sunday, December 16, 2018 7:56 PM
valeryk2000 - Sunday, December 16, 2018 7:00 PM
Jeff Moden - Sunday, December 16, 2018 5:37 PM
One more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods. With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
https://sqlperformance.com/2012/08/t-sql-queries/median
And, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉
One question what about NULL values
PERCENTILE_CONT and PERCENTILE_DISC both ignore null values.
But, to be honest, I wouldn't use either. I was testing some the code in the performance article and got to the last contender and had an extreme bout with deja vue... I'd seen the code before. I "dialed back" and found the code that Peter Larsson wrote that uses ORDER BY/OFFSET/FETCH that he was given credit for by the author. Peter wrote it way back in 2009 and it absolutely blows the doors off the PERCENTILE_* functions. PERCENTILE_CONT took 1 minutes an 23 seconds to find the median on 10 million rows. While that sounds pretty good, Peter's method finds it in sub-second times. No... that's not a misprint. Peter's method takes less than a second!!! His formulas are absolutely brilliant in their simplicity.
I tried to find Peter's original article on the subject... unfortunately, the link I had no longer works and the link I found doesn't either. However, it IS the last contender in the article that Aaron wrote and you can get the code from there.

https://weblogs.sqlteam.com/peterl/2009/09/16/median-and-weighted-median/
Larssen on median - is it the one?

Jeff Moden SSC Guru Points: 1004421 More actions · Answer 2

valeryk2000 - Sunday, December 16, 2018 8:09 PM
Jeff Moden - Sunday, December 16, 2018 7:56 PM
valeryk2000 - Sunday, December 16, 2018 7:00 PM
Jeff Moden - Sunday, December 16, 2018 5:37 PM
One more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods. With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
https://sqlperformance.com/2012/08/t-sql-queries/median
And, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉
One question what about NULL values
PERCENTILE_CONT and PERCENTILE_DISC both ignore null values.
But, to be honest, I wouldn't use either. I was testing some the code in the performance article and got to the last contender and had an extreme bout with deja vue... I'd seen the code before. I "dialed back" and found the code that Peter Larsson wrote that uses ORDER BY/OFFSET/FETCH that he was given credit for by the author. Peter wrote it way back in 2009 and it absolutely blows the doors off the PERCENTILE_* functions. PERCENTILE_CONT took 1 minutes an 23 seconds to find the median on 10 million rows. While that sounds pretty good, Peter's method finds it in sub-second times. No... that's not a misprint. Peter's method takes less than a second!!! His formulas are absolutely brilliant in their simplicity.
I tried to find Peter's original article on the subject... unfortunately, the link I had no longer works and the link I found doesn't either. However, it IS the last contender in the article that Aaron wrote and you can get the code from there.
https://weblogs.sqlteam.com/peterl/2009/09/16/median-and-weighted-median/
Larssen on median - is it the one?

That's the one I was thinking of but, no, that's not the right article. I've gotten things a little mixed up. The date on the file that I had saved from one of his blogs had a 2014 date on the file. I thought it was referring back to his 2009 article, which you've cited. Of course, I've also flubbed on when OFFSET first became available... that was in 2012 and not 2008 as I originally mis-remembered. There was another article that he wrote about using the offset method but, like I said, his method is covered as the last contender in Aaron's article.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

valeryk2000 SSCarpal Tunnel Points: 4245 More actions · Answer 3

Naïve question. In the original line of my associate (she's testing our SS 2008 codes on SS 2016 where we are migrating soon)
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY [recordvalue]) OVER (PARTITION BY null) from ....
the OVER has (PARTITION BY null) and in your codes - OVER (). Can you explain?
Thanks

Jeff Moden SSC Guru Points: 1004421 More actions · Answer 4

valeryk2000 - Monday, December 17, 2018 8:24 AM
NaÃ¯ve question. In the original line of my associate (she's testing our SS 2008 codes on SS 2016 where we are migrating soon)
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY [recordvalue]) OVER (PARTITION BY null) from ....
the OVER has (PARTITION BY null) and in your codes - OVER (). Can you explain?
Thanks

The function requires the use of OVER. Normally, OVER is used to provide an indication of how to partition the data into grouped subsets, each of which would end up having it's own median calculated. We're not looking for such groupings in the current problem and so there's no need for a partition even though the OVER() clause is still required. Not all functions will allow a required OVER clause to be empty but these do. We want the whole shebang to be included for the median and so no partitioning is required and so the required OVER() clause can be empty rather than having to do some trick like OVER (PARTITION BY null) to include the whole table for the median calculation.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004421 More actions · Answer 5

Shirting gears back to my previous thoughts, I'm not sure why we're even talking about the PERCENTILE_* functions. They're a hundred times slower than Peter Larsson's method of using OFFSET/FETCH. I know I wouldn't be using them just like you'll not catch me using the FORMAT function until they fix the performance issues there.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

valeryk2000 SSCarpal Tunnel Points: 4245 More actions · Answer 6

valeryk2000

SSCarpal Tunnel

Points: 4245

December 17, 2018 at 12:28 pm

#2016561

Tried Offset/Fetch - works slower than Percentile ...

Jeff Moden SSC Guru Points: 1004421 More actions · Answer 7

valeryk2000 - Monday, December 17, 2018 12:28 PM
Tried Offset/Fetch - works slower than Percentile ...

Do you have the correct index in place to support Offset/Fetch?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

valeryk2000 SSCarpal Tunnel Points: 4245 More actions · Answer 8

Sorry .... which index should it be?
BTW, I can send you a chunk of data in Excel ... if you have time and want to continue the research ...

Jeff Moden SSC Guru Points: 1004421 More actions · Answer 9

valeryk2000 - Tuesday, December 18, 2018 7:13 AM
Sorry .... which index should it be?
BTW, I can send you a chunk of data in Excel ... if you have time and want to continue the research ...

The index has to be on the column(s) involved in finding the median. If you're trying to find the median for the whole table, then the index only needs to be on the value column that you're doing the median on. If you're trying to do medians on groups within the table, the index should key on the column(s) that form the group key and the value column that you're doing the median on.

As for the spreadsheet, it would be nearly useless to me because I don't have the time to convert it or import it. I'd need the data in a readily consumable format. Please see the article at the first link in my signature line below under "Helpful Links" for how to do that and attach the "files" that you create using that method. Of course, no sample data posted to a forum should ever contain PII, proprietary, or otherwise sensitive data.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

valeryk2000 SSCarpal Tunnel Points: 4245 More actions · Answer 10

Jeff I created the script - it is 17 Mb - 45 000 records, table with 5 fields. Not sure that we need to continue at this point - we have the answer to the original question. We will test different codes. And not sure that you have time and intention to further drill.
A?

valeryk2000 SSCarpal Tunnel Points: 4245 More actions · Answer 11

Jeff
what does 'v' means in this code?

... SELECT SomeInt,SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)

Jeff Moden SSC Guru Points: 1004421 More actions · Answer 12

valeryk2000 - Tuesday, December 18, 2018 10:43 AM
Jeff I created the script - it is 17 Mb - 45 000 records, table with 5 fields. Not sure that we need to continue at this point - we have the answer to the original question. We will test different codes. And not sure that you have time and intention to further drill.
A?

That's a bit large for this forum. Glad you didn't attach it as a spreadsheet.

If you post the table schema (the CREATE TABLE code for the table) and also identify which column you're trying to get the median for and whether it is a median for the whole table or you want more than one median based on some grouping, I can generate some fake data pretty quickly. It won't take long to do a comparison after that to see what's going on.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004421 More actions · Answer 13

valeryk2000 - Tuesday, December 18, 2018 11:00 AM
Jeff
what does 'v' means in this code?
... SELECT SomeInt,SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)

It's the alias-name for the derived table of data formed by the VALUES clause. If I were to use that alias in the code above, it would look like this...
SELECT v.SomeInt, v.SomeDecimal FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

valeryk2000 SSCarpal Tunnel Points: 4245 More actions · Answer 14

Jeff Moden - Tuesday, December 18, 2018 7:59 PM
valeryk2000 - Tuesday, December 18, 2018 11:00 AM
Jeff
what does 'v' means in this code?
... SELECT SomeInt,SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)
It's the alias-name for the derived table of data formed by the VALUES clause. If I were to use that alias in the code above, it would look like this...
SELECT v.SomeInt, v.SomeDecimal FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)

Then why field names in parentheses after v?

valeryk2000 SSCarpal Tunnel Points: 4245 More actions · Answer 15

Jeff Moden - Tuesday, December 18, 2018 7:55 PM
valeryk2000 - Tuesday, December 18, 2018 10:43 AM
Jeff I created the script - it is 17 Mb - 45 000 records, table with 5 fields. Not sure that we need to continue at this point - we have the answer to the original question. We will test different codes. And not sure that you have time and intention to further drill.
A?
That's a bit large for this forum. Glad you didn't attach it as a spreadsheet.
If you post the table schema (the CREATE TABLE code for the table) and also identify which column you're trying to get the median for and whether it is a median for the whole table or you want more than one median based on some grouping, I can generate some fake data pretty quickly. It won't take long to do a comparison after that to see what's going on.

Ok. Tomorrow I'll send you a much smaller script with table schema and data