April 9, 2009 at 3:12 am
hi,
i wanted to calculate median value.
have a table fact_abc which contains cust_key, fscore
have a dimension dim_cust, cust_key and indicator
i wrote a query like follows
select cust_key,min(fscore),max(fscore),avg(fscore) from fact_abc a, dim_cust b where a.cust_key=b.cust_key and b.indicator=1
group by cust_key
the above query is woring fine.
i want to calculate median(fscore) along with the above selected fields.
please let me know how can i achieve it as we dont have median function in mssql. one more thing is my fact contains 15 million and dim contains 500000 records
April 9, 2009 at 3:44 am
Hi
Maybe my English is too bad, but could you explain me the difference between "median" and "average"?
Edited: Or I just don't get it... 😉
Greets
Flo
April 9, 2009 at 5:03 am
WITH CTE
AS
(
SELECT a.cust_key,
a.fscore,
ROW_NUMBER() OVER(PARTITION BY a.cust_key ORDER BY a.fscore ASC) AS rn,
COUNT(*) OVER(PARTITION BY a.cust_key) AS cn
FROM fact_abc a
INNER JOIN dim_cust b ON a.cust_key=b.cust_key AND b.indicator=1
)
SELECT cust_key,
MIN(fscore) AS [Minimum],
MAX(fscore) AS [Maximum],
AVG(fscore) AS [Average],
AVG(CASE WHEN rn BETWEEN cn-rn AND cn-rn+2 THEN fscore END) AS Median
FROM CTE
GROUP BY cust_key
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537April 9, 2009 at 6:08 am
I would make a couple of minor changes to Mark's query:
1) Use COUNT(a.fscore) rather than COUNT(*) when calculating cn in order to avoid counting rows with NULL fscore values.
2) Use following expression to calculate the median:
AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN fscore END)
When the number of non-NULL fscore values is odd, this expression will return the value of the single central value.
When the number of non-NULL fscore values is even, this expression will return the average (i.e. mean) of the 2 central values.
WITH CTE
AS
(
SELECT a.cust_key,
a.fscore,
ROW_NUMBER() OVER (PARTITION BY a.cust_key ORDER BY a.fscore ASC) AS rn,
COUNT(a.fscore) OVER(PARTITION BY a.cust_key) AS cn
FROM fact_abc a
INNER JOIN dim_cust b ON (a.cust_key = b.cust_key)
WHERE (b.indicator = 1)
)
SELECT cust_key,
MIN(fscore) AS [Minimum],
MAX(fscore) AS [Maximum],
AVG(fscore) AS [Average],
AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN fscore END) AS Median
FROM CTE
GROUP BY cust_key
April 9, 2009 at 6:16 am
andrewd.smith (4/9/2009)
I would make a couple of minor changes to Mark's query:1) Use COUNT(a.fscore) rather than COUNT(*) when calculating cn in order to avoid counting rows with NULL fscore values.
2) Use following expression to calculate the median:
AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN fscore END)
When the number of non-NULL fscore values is odd, this expression will return the value of the single central value.
When the number of non-NULL fscore values is even, this expression will return the average (i.e. mean) of the 2 central values.
WITH CTE
AS
(
SELECT a.cust_key,
a.fscore,
ROW_NUMBER() OVER (PARTITION BY a.cust_key ORDER BY a.fscore ASC) AS rn,
COUNT(a.fscore) OVER(PARTITION BY a.cust_key) AS cn
FROM fact_abc a
INNER JOIN dim_cust b ON (a.cust_key = b.cust_key)
WHERE (b.indicator = 1)
)
SELECT cust_key,
MIN(fscore) AS [Minimum],
MAX(fscore) AS [Maximum],
AVG(fscore) AS [Average],
AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN fscore END) AS Median
FROM CTE
GROUP BY cust_key
I believe this is the "financial median", to get the "statistical median" do this
AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN fscore END) AS FinancialMedian,
MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN fscore END) AS StatisticalMedian
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537April 9, 2009 at 12:27 pm
Flo: An average is reached by summing the values and dividing by the number of values. A median is the value at the middle of an ordered set of values. A clumsy illustration follows:
declare @temp table (value numeric (5,2) not null) ;
--
with Tally (N) AS
(SELECT TOP 5 ROW_NUMBER() over (order by sc1.id)
FROM Master.dbo.SysColumns sc1 )
insert into @temp
select N from tally ;
--
insert into @temp
select top 2 value from @temp ;
--
select * from @temp order by value ;
--
with cteSummary as
(select sum(value) as totalValue,count(value) as rowsCount,avg(value) as avgValue from @temp)
,cteTemp as
(select value, ROW_NUMBER() over (order by value) as rowID from @temp)
select *
,(ceiling(rowsCount/2.0)) as medianPosition
,(select value from cteTemp where rowID = ceiling(rowsCount/2.0)) as medianValue
from cteSummary
Obviously it's easy when there are an odd number of values. There are different policies which can be applied when there are an even number of values.
http://en.wikipedia.org/wiki/Median
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 9, 2009 at 12:33 pm
Hi Bob!
Thanks a lot for the English education! 🙂
Greets
Flo
April 9, 2009 at 12:36 pm
Always happy to help, Flo. 🙂
By the way, Mark is right. NULLS shouldn't count. My code above would give false results if null values were allowed into @temp.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply