November 20, 2015 at 10:13 am
Hello everyone,
I have exhausted all options by now! I have research the entire net to try to find some different ways to accomplish this task, and every solution out there doesn't work when my data set contains duplicates. I have a several sets of records where the cost is the same for different items and when trying to calculate the median, it almost always gives me the wrong one. All these solutions work fine when there are no duplicates. I have used the 4 different solutions below and none seems to give me what i need.
Can anyone see if there are any issues as to why they would be not working? The Database server where these queries reside is a 2008 R2, and I can't use the PERCENTILE function from 2012 unfortunately.
Dataset:
335.00
335.00
325.00
325.00
375.00
375.00
375.00
375.00
375.00
375.00
345.00
345.00
325.00
325.00
370.00
370.00
345.00
345.00
370.00
370.00
375.00
375.00
335.00
335.00
335.00
335.00
Solutions Used:
---------- MEDIAN OPTION #1 -------------
SELECT OPTN_NUM,
AVG(BALANCE) AS MedianValue
FROM
(
SELECT
OPTN_NUM,
BALANCE,
ROW_NUMBER() OVER (
PARTITION BY OPTN_NUM
ORDER BY BALANCE) AS RowNum,
COUNT(*) OVER (
PARTITION BY OPTN_NUM) AS RowCnt
FROM #TempTable
) x
WHERE
RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)
GROUP BY OPTN_NUM
---------- MEDIAN OPTION #2 -------------
WITH Counts AS
(
SELECT OPTN_NUM, c=COUNT(*)
FROM #TempTable
GROUP BY OPTN_NUM
)
SELECT a.OPTN_NUM, Median=AVG(0.+BALANCE)
FROM Counts a
CROSS APPLY
(
SELECT TOP(((a.c - 1) / 2) + (1 + (1 - a.c % 2)))
BALANCE, r=ROW_NUMBER() OVER (ORDER BY BALANCE)
FROM #TempTable b
WHERE a.OPTN_NUM = b.OPTN_NUM
ORDER BY BALANCE
) p
WHERE r BETWEEN ((a.c - 1) / 2) + 1 AND (((a.c - 1) / 2) + (1 + (1 - a.c % 2)))
GROUP BY a.OPTN_NUM;
---------- MEDIAN OPTION #3 -------------
SELECT
OPTN_NUM,
AVG(BALANCE)
FROM
(
SELECT
OPTN_NUM,
BALANCE,
RowDesc,
ROW_NUMBER() OVER (
PARTITION BY OPTN_NUM
ORDER BY BALANCE ASC, ID ASC, RowDesc DESC) AS RowAsc
FROM (
SELECT
OPTN_NUM,
BALANCE,
ID,
ROW_NUMBER() OVER (
PARTITION BY OPTN_NUM
ORDER BY BALANCE DESC, ID DESC) AS RowDesc
FROM #TempTable) SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY OPTN_NUM
ORDER BY OPTN_NUM;
------------ MEDIAN OPTION #4 ---------------
SELECT
OPTN_NUM,
AVG(BALANCE)
FROM
(
SELECT
OPTN_NUM,
BALANCE,
ROW_NUMBER() OVER (
PARTITION BY OPTN_NUM
ORDER BY BALANCE ASC, Id ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY OPTN_NUM
ORDER BY BALANCE DESC, Id DESC) AS RowDesc
FROM #TempTable SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY OPTN_NUM;
ORDER BY CustomerId;
OPTN_NUM is my grouping, ID is a unique identifier that i add when populating the temp table and the balance is the actual cost of the product. For the example above I always get 370 as the median from every SQL query option I listed above.
Any help is very much appreciated.
Thanks!
R.
November 20, 2015 at 11:03 am
Can you post DDL and sample data that we can actually use? Your solutions use 3 columns and you only showed one.
November 20, 2015 at 11:08 am
I feel like there must be more at work than what we've seen so far, because a quick test with your sample data returns 345.00 for just the first method (I haven't tested the others).
SELECT *
INTO #TempTable
FROM ( VALUES ( 1, 335.00), ( 1, 335.00), ( 1, 325.00), ( 1, 325.00),
( 1, 375.00), ( 1, 375.00), ( 1, 375.00), ( 1, 375.00), ( 1, 375.00),
( 1, 375.00), ( 1, 345.00), ( 1, 345.00), ( 1, 325.00), ( 1, 325.00),
( 1, 370.00), ( 1, 370.00), ( 1, 345.00), ( 1, 345.00), ( 1, 370.00),
( 1, 370.00), ( 1, 375.00), ( 1, 375.00), ( 1, 335.00), ( 1, 335.00),
( 1, 335.00), ( 1, 335.00) ) Temp ( OPT_NUM, BALANCE );
SELECT OPT_NUM ,
AVG(BALANCE) AS MedianValue
FROM ( SELECT BALANCE ,
OPT_NUM ,
ROW_NUMBER() OVER ( PARTITION BY OPT_NUM ORDER BY BALANCE ) AS RowNum ,
COUNT(*) OVER ( PARTITION BY OPT_NUM ) AS RowCnt
FROM #TempTable
) x
WHERE RowNum IN ( ( RowCnt + 1 ) / 2, ( RowCnt + 2 ) / 2 )
GROUP BY OPT_NUM;
DROP TABLE #TempTable;
As Luis mentioned, giving us the exact data you're working with will give us a better chance of figuring this out.
Cheers!
November 20, 2015 at 11:47 am
First of all, tHanks for taking the time to look at this. It has been driving me nuts as to why this is not working. I actually just like your sample have created a smaller set just using those 2 fieds with a unique Identifier and it works, however my original dataset is larger with other fields that are not relevant as I am using the ones I indicated in the query to find the Median.
First I have a temptable with other values as noted below:
CREATE TABLE #TempTable
(ID INT IDENTITY PRIMARY KEY,
OPTN_NUM INT,
OPTN_NUM_DESC_DSPLY VARCHAR(100),
CMNTY_DESC_NUM_DSPLY VARCHAR(100),
PLAN_DESC_ID_DSPLY VARCHAR(100),
ACT_STATU_DSPLY VARCHAR(100),
VENDOR_VNAME_NUM VARCHAR(100),
ACCT_CATEGORY INT,
BALANCE MONEY,
ACT_phjob_type INT,
BUDGET_NBR INT,
MEDIAN_VALUE MONEY,
)
Then i do my queries to populate all other fields, except the Median, which its set to NULL. Then i run my query to find the query for each group and then do a MERGE into the table as follows:
MERGE INTO #TempTable Temp
USING (
SELECT
OPTN_NUM,
AVG(BALANCE) AS MedianValue
FROM
(
SELECT
OPTN_NUM,
BALANCE,
ROW_NUMBER() OVER (
PARTITION BY OPTN_NUM
ORDER BY BALANCE ASC, Id ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY OPTN_NUM
ORDER BY BALANCE DESC, Id DESC) AS RowDesc
FROM #TempTable SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY OPTN_NUM
) Median
ON Temp.OPTN_NUM = Median.OPTN_NUM
WHEN MATCHED THEN
UPDATE
SET MEDIAN_VALUE = Median.MedianValue;
I also put just the MEDIAN query prior to the MERGE to make sure it's giving me the correct values, but nothing! Its still shows 370 for the dataset I gave you above.
I am lost as to why it's doing that. Here is some origsample data:
IDOPTN_NUMOPTN_NUM_DESC_DSPLYCMNTY_DESC_NUM_DSPLYPLAN_DESC_ID_DSPLYACT_STATU_DSPLYVENDOR_VNAME_NUMACCT_CATEGORYBALANCEACT_phjob_typeBUDGET_NBRMEDIAN_VALUE
17182071820 - Bath Ctop - Master Bath - Corian - Level 1Paseo Pte1 Maravilla 3500s - 304235146 - Nuvola - 35146A-ActiveL & M LAMINATES & MARBLE, INC. - 15000122722314375.0081NULL
Sorry, Idont know how to format better.
THanks a million.
November 20, 2015 at 1:59 pm
Gentlemen,
I have found the issue with my query. I was making the mistake of calculating the Median on a NON UNIQUE dataset which I later displayed with DISTINCT, showing me the incorrect value! It was a dumb mistake, but having your input as far as supplying a data sample made me see this, so i thank you so much for your help!
Really much appreciated as always!
R.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply