December 16, 2018 at 9:06 am
Question: is PERCENTILE_DISC for median calculation works for both odd and even number of records?
In SQL Server 2008 to calculate median we used this code based on modulo (@modulo was calculated before this)
case @modulo
when 1 then
(SELECT MAX([TimeInDepartmentMinutes]) FROM
(SELECT TOP 50 PERCENT [TimeInDepartmentMinutes] FROM dbo.vw_ED_Time_Calc
where ArrivalDateTime >= @StartDate and ArrivalDateTime <= @EndDate and Psyc ='yes' and HospitalName = @Hospital
ORDER BY [TimeInDepartmentMinutes]) AS Median)
else -- even
(SELECT
( ( SELECT MAX([TimeInDepartmentMinutes]) FROM
( SELECT TOP 50 PERCENT [TimeInDepartmentMinutes]
FROM dbo.vw_ED_Time_Calc where ArrivalDateTime >= @StartDate and ArrivalDateTime <= @EndDate and Psyc ='yes' and HospitalName = @Hospital
ORDER BY 1--==>
) AS m1 )
+ (
SELECT MIN([TimeInDepartmentMinutes]) FROM
( SELECT TOP 50 PERCENT [TimeInDepartmentMinutes] FROM dbo.vw_ED_Time_Calc
where ArrivalDateTime >= @StartDate and ArrivalDateTime <= @EndDate and Psyc ='yes' and HospitalName = @Hospital
ORDER BY 1 DESC --==>
) AS m2)
) / 2 AS Me)
end AS Median_LOS_All_Psyc_Patients
December 16, 2018 at 9:34 am
Run a test and find out. 😉 When you're done testing it, tell us what you found. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2018 at 10:44 am
Test result (table has 4 records: 1,2,3,4):
Select distinct PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY [recordvalue]) OVER (PARTITION BY null) FROM __testForMedian
Result: 2
Correct answer: 2.5 (in Excel =MEDIAN(1,2,3,4))
Hence
1. PERCENTILE_DISC(0.5) is the same as SELECT MAX([recordvalue]) FROM (SELECT TOP 50 PERCENT recordvalue FROM ... etc
2. PERCENTILE_DISC(0.5) can be used for ODD number of records. For EVEN nuber of records we need to apply two step strategy listed in my original code after ELSE ...
Thanks
December 16, 2018 at 11:57 am
So, in reference to your original question of...
Question: is PERCENTILE_DISC for median calculation works for both odd and even number of records?
...what is your conclusion?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2018 at 12:50 pm
PERCENTILE_DISC (0.5) does not provide the median for EVEN number of records. Only for ODD number of records. Actually this is the same as MAX (Select 50 percent ... - see my original code above
Thanks Jeff. You are always very helpful.
December 16, 2018 at 1:28 pm
valeryk2000 - Sunday, December 16, 2018 12:50 PMPERCENTILE_DISC (0.5) does not provide the median for EVEN number of records. Only for ODD number of records. Actually this is the same as MAX (Select 50 percent ... - see my original code above
Thanks Jeff. You are always very helpful.
Interesting. Thanks for taking the time to test. I'll have to look into that a bit more because BOL makes no such stipulation (or at least no one that I've seen). Might just be that BOL needs a correction.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2018 at 1:35 pm
Ah... I think I found it. From BOL...
Kind of makes sense. So, with Integer values of 1,2,3,4, it returns the number closest to the median. I don't have 2012 or above to play with just now but if you run the same test as you did but with a DECIMAL(9,5) for a datatype, what do you get for a result there?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2018 at 2:57 pm
Same result:
2.00000
December 16, 2018 at 3:59 pm
valeryk2000 - Sunday, December 16, 2018 2:57 PMSame result:
2.00000
Yowch. That's just not right. Thank you very much for testing. I should be able to log into my 2016 work box in a couple of minutes from now. I'll do the same test.
Actually, for Percentile_Disc, it's absolutely correct. See my explanation and test code below.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2018 at 4:47 pm
Heh.... lordy. I just reread BOL for PERCENTILE_DISC ("disc" means "discrete").. It INTENTIONALLY works the way your tests came out. Here's the description found at the official MS documentation at the following URL. So, no... BOL is NOT incorrect. It's spot on.
Computes a specific percentile for sorted values in an entire rowset or within distinct partitions of a rowset in SQL Server. For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P. For example, PERCENTILE_DISC (0.5) will compute the 50th percentile (that is, the median) of an expression. PERCENTILE_DISC calculates the percentile based on a discrete distribution of the column values; the result is equal to a specific value in the column.
It turns out that PERCENTILE_CONT ("cont" means "continuous") will calculate the values that you'd normally expect from a "median". The two functions are designed to return different things. PERCENTILE_DISC returns an existing value from the column that is the maximum value less than or equal to the median. PERCENTILE_CONT returns the calculated median (interpolated value of the two "middle" values for even numbers of rows.
With all of that in mind, please run the following test code to see what I'm talking about.
--===== Create the test table with INT and DECIMAL datatypes
DROP TABLE IF EXISTS #MyHead;
CREATE TABLE #MyHead
(
SomeInt INT
,SomeDecimal DECIMAL(9,5)
)
;
--===== Populate the both columns with identical values (1 thru 4) for an EVEN number of rows.
INSERT INTO #MyHead
(SomeInt,SomeDecimal)
SELECT SomeInt,SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4))v(SomeInt,SomeDecimal)
;
--===== Run the test for both PERCENTILE_* functions against both columns
SELECT DISTINCT
PrecentileContINT = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SomeInt) OVER ()
,PercentileDiscINT = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY SomeInt) OVER ()
,PrecentileContDEC = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SomeDecimal) OVER ()
,PercentileDiscDEC = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY SomeDecimal) OVER ()
FROM #MyHead
;
--===== Add a 5th row to the test table to make the number of rows and ODD number.
INSERT INTO #MyHead
(SomeInt,SomeDecimal)
SELECT SomeInt,SomeDecimal
FROM (VALUES (10,10))v(SomeInt,SomeDecimal)
;
--===== Run the exact same test again.
SELECT DISTINCT
PrecentileContINT = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SomeInt) OVER ()
,PercentileDiscINT = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY SomeInt) OVER ()
,PrecentileContDEC = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SomeDecimal) OVER ()
,PercentileDiscDEC = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY SomeDecimal) OVER ()
FROM #MyHead
;
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2018 at 5:27 pm
BTW... the way to get the code to display without all the extra blank lines in it is to copy'n'paste the code into NOTEPAD first. Then, copy'n'paste from NOTEPAD to in between the SQL Code tags that appear when you hit the "SQL Code" button at the bottom of the edit window.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2018 at 5:37 pm
One more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods. With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
https://sqlperformance.com/2012/08/t-sql-queries/median
And, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2018 at 6:41 pm
Yes, it is PERCENTILE_CONT(0.5) that returns median. I'll play with it tomorrow.
Thanks
December 16, 2018 at 7:00 pm
Jeff Moden - Sunday, December 16, 2018 5:37 PMOne more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods. With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
https://sqlperformance.com/2012/08/t-sql-queries/medianAnd, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉
One question what about NULL values
December 16, 2018 at 7:56 pm
valeryk2000 - Sunday, December 16, 2018 7:00 PMJeff Moden - Sunday, December 16, 2018 5:37 PMOne more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods. With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
https://sqlperformance.com/2012/08/t-sql-queries/medianAnd, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉
One question what about NULL values
PERCENTILE_CONT and PERCENTILE_DISC both ignore null values.
But, to be honest, I wouldn't use either. I was testing some the code in the performance article and got to the last contender and had an extreme bout with deja vue... I'd seen the code before. I "dialed back" and found the code that Peter Larsson wrote that uses ORDER BY/OFFSET/FETCH that he was given credit for by the author. Peter wrote it way back in 2009 and it absolutely blows the doors off the PERCENTILE_* functions. PERCENTILE_CONT took 1 minutes an 23 seconds to find the median on 10 million rows. While that sounds pretty good, Peter's method finds it in sub-second times. No... that's not a misprint. Peter's method takes less than a second!!! His formulas are absolutely brilliant in their simplicity.
I tried to find Peter's original article on the subject... unfortunately, the link I had no longer works and the link I found doesn't either. However, it IS the last contender in the article that Aaron wrote and you can get the code from there.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 42 total)
You must be logged in to reply to this topic. Login to reply