December 16, 2018 at 8:09 pm
Jeff Moden - Sunday, December 16, 2018 7:56 PMvaleryk2000 - Sunday, December 16, 2018 7:00 PMJeff Moden - Sunday, December 16, 2018 5:37 PMOne more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods. With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
https://sqlperformance.com/2012/08/t-sql-queries/medianAnd, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉
One question what about NULL values
PERCENTILE_CONT and PERCENTILE_DISC both ignore null values.
But, to be honest, I wouldn't use either. I was testing some the code in the performance article and got to the last contender and had an extreme bout with deja vue... I'd seen the code before. I "dialed back" and found the code that Peter Larsson wrote that uses ORDER BY/OFFSET/FETCH that he was given credit for by the author. Peter wrote it way back in 2009 and it absolutely blows the doors off the PERCENTILE_* functions. PERCENTILE_CONT took 1 minutes an 23 seconds to find the median on 10 million rows. While that sounds pretty good, Peter's method finds it in sub-second times. No... that's not a misprint. Peter's method takes less than a second!!! His formulas are absolutely brilliant in their simplicity.
I tried to find Peter's original article on the subject... unfortunately, the link I had no longer works and the link I found doesn't either. However, it IS the last contender in the article that Aaron wrote and you can get the code from there.
https://weblogs.sqlteam.com/peterl/2009/09/16/median-and-weighted-median/
Larssen on median - is it the one?
December 16, 2018 at 9:28 pm
valeryk2000 - Sunday, December 16, 2018 8:09 PMJeff Moden - Sunday, December 16, 2018 7:56 PMvaleryk2000 - Sunday, December 16, 2018 7:00 PMJeff Moden - Sunday, December 16, 2018 5:37 PMOne more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods. With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
https://sqlperformance.com/2012/08/t-sql-queries/medianAnd, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉
One question what about NULL values
PERCENTILE_CONT and PERCENTILE_DISC both ignore null values.
But, to be honest, I wouldn't use either. I was testing some the code in the performance article and got to the last contender and had an extreme bout with deja vue... I'd seen the code before. I "dialed back" and found the code that Peter Larsson wrote that uses ORDER BY/OFFSET/FETCH that he was given credit for by the author. Peter wrote it way back in 2009 and it absolutely blows the doors off the PERCENTILE_* functions. PERCENTILE_CONT took 1 minutes an 23 seconds to find the median on 10 million rows. While that sounds pretty good, Peter's method finds it in sub-second times. No... that's not a misprint. Peter's method takes less than a second!!! His formulas are absolutely brilliant in their simplicity.
I tried to find Peter's original article on the subject... unfortunately, the link I had no longer works and the link I found doesn't either. However, it IS the last contender in the article that Aaron wrote and you can get the code from there.
https://weblogs.sqlteam.com/peterl/2009/09/16/median-and-weighted-median/
Larssen on median - is it the one?
That's the one I was thinking of but, no, that's not the right article. I've gotten things a little mixed up. The date on the file that I had saved from one of his blogs had a 2014 date on the file. I thought it was referring back to his 2009 article, which you've cited. Of course, I've also flubbed on when OFFSET first became available... that was in 2012 and not 2008 as I originally mis-remembered. There was another article that he wrote about using the offset method but, like I said, his method is covered as the last contender in Aaron's article.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 17, 2018 at 8:24 am
Naïve question. In the original line of my associate (she's testing our SS 2008 codes on SS 2016 where we are migrating soon)
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY [recordvalue]) OVER (PARTITION BY null) from ....
the OVER has (PARTITION BY null) and in your codes - OVER (). Can you explain?
Thanks
December 17, 2018 at 8:46 am
valeryk2000 - Monday, December 17, 2018 8:24 AMNaïve question. In the original line of my associate (she's testing our SS 2008 codes on SS 2016 where we are migrating soon)
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY [recordvalue]) OVER (PARTITION BY null) from ....
the OVER has (PARTITION BY null) and in your codes - OVER (). Can you explain?
Thanks
The function requires the use of OVER. Normally, OVER is used to provide an indication of how to partition the data into grouped subsets, each of which would end up having it's own median calculated. We're not looking for such groupings in the current problem and so there's no need for a partition even though the OVER() clause is still required. Not all functions will allow a required OVER clause to be empty but these do. We want the whole shebang to be included for the median and so no partitioning is required and so the required OVER() clause can be empty rather than having to do some trick like OVER (PARTITION BY null) to include the whole table for the median calculation.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 17, 2018 at 8:50 am
Shirting gears back to my previous thoughts, I'm not sure why we're even talking about the PERCENTILE_* functions. They're a hundred times slower than Peter Larsson's method of using OFFSET/FETCH. I know I wouldn't be using them just like you'll not catch me using the FORMAT function until they fix the performance issues there.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 17, 2018 at 12:28 pm
Tried Offset/Fetch - works slower than Percentile ...
December 17, 2018 at 3:48 pm
valeryk2000 - Monday, December 17, 2018 12:28 PMTried Offset/Fetch - works slower than Percentile ...
Do you have the correct index in place to support Offset/Fetch?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 18, 2018 at 7:13 am
Sorry .... which index should it be?
BTW, I can send you a chunk of data in Excel ... if you have time and want to continue the research ...
December 18, 2018 at 7:26 am
valeryk2000 - Tuesday, December 18, 2018 7:13 AMSorry .... which index should it be?
BTW, I can send you a chunk of data in Excel ... if you have time and want to continue the research ...
The index has to be on the column(s) involved in finding the median. If you're trying to find the median for the whole table, then the index only needs to be on the value column that you're doing the median on. If you're trying to do medians on groups within the table, the index should key on the column(s) that form the group key and the value column that you're doing the median on.
As for the spreadsheet, it would be nearly useless to me because I don't have the time to convert it or import it. I'd need the data in a readily consumable format. Please see the article at the first link in my signature line below under "Helpful Links" for how to do that and attach the "files" that you create using that method. Of course, no sample data posted to a forum should ever contain PII, proprietary, or otherwise sensitive data.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 18, 2018 at 10:43 am
Jeff I created the script - it is 17 Mb - 45 000 records, table with 5 fields. Not sure that we need to continue at this point - we have the answer to the original question. We will test different codes. And not sure that you have time and intention to further drill.
A?
December 18, 2018 at 11:00 am
Jeff
what does 'v' means in this code?
... SELECT SomeInt,SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)
December 18, 2018 at 7:55 pm
valeryk2000 - Tuesday, December 18, 2018 10:43 AMJeff I created the script - it is 17 Mb - 45 000 records, table with 5 fields. Not sure that we need to continue at this point - we have the answer to the original question. We will test different codes. And not sure that you have time and intention to further drill.
A?
That's a bit large for this forum. Glad you didn't attach it as a spreadsheet.
If you post the table schema (the CREATE TABLE code for the table) and also identify which column you're trying to get the median for and whether it is a median for the whole table or you want more than one median based on some grouping, I can generate some fake data pretty quickly. It won't take long to do a comparison after that to see what's going on.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 18, 2018 at 7:59 pm
valeryk2000 - Tuesday, December 18, 2018 11:00 AMJeff
what does 'v' means in this code?... SELECT SomeInt,SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)
It's the alias-name for the derived table of data formed by the VALUES clause. If I were to use that alias in the code above, it would look like this...
SELECT v.SomeInt, v.SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)
--Jeff Moden
Change is inevitable... Change for the better is not.
December 18, 2018 at 8:09 pm
Jeff Moden - Tuesday, December 18, 2018 7:59 PMvaleryk2000 - Tuesday, December 18, 2018 11:00 AMJeff
what does 'v' means in this code?... SELECT SomeInt,SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)It's the alias-name for the derived table of data formed by the VALUES clause. If I were to use that alias in the code above, it would look like this...
SELECT v.SomeInt, v.SomeDecimal
FROM (VALUES (1,1),(2,2),(3,3),(4,4)) v (SomeInt,SomeDecimal)
Then why field names in parentheses after v?
December 18, 2018 at 8:12 pm
Jeff Moden - Tuesday, December 18, 2018 7:55 PMvaleryk2000 - Tuesday, December 18, 2018 10:43 AMJeff I created the script - it is 17 Mb - 45 000 records, table with 5 fields. Not sure that we need to continue at this point - we have the answer to the original question. We will test different codes. And not sure that you have time and intention to further drill.
A?That's a bit large for this forum. Glad you didn't attach it as a spreadsheet.
If you post the table schema (the CREATE TABLE code for the table) and also identify which column you're trying to get the median for and whether it is a median for the whole table or you want more than one median based on some grouping, I can generate some fake data pretty quickly. It won't take long to do a comparison after that to see what's going on.
Ok. Tomorrow I'll send you a much smaller script with table schema and data
Viewing 15 posts - 16 through 30 (of 42 total)
You must be logged in to reply to this topic. Login to reply