Query to find Median isn't working with duplicates?

  • Hello everyone,

    I have exhausted all options by now! I have research the entire net to try to find some different ways to accomplish this task, and every solution out there doesn't work when my data set contains duplicates. I have a several sets of records where the cost is the same for different items and when trying to calculate the median, it almost always gives me the wrong one. All these solutions work fine when there are no duplicates. I have used the 4 different solutions below and none seems to give me what i need.

    Can anyone see if there are any issues as to why they would be not working? The Database server where these queries reside is a 2008 R2, and I can't use the PERCENTILE function from 2012 unfortunately.

    Dataset:

    335.00

    335.00

    325.00

    325.00

    375.00

    375.00

    375.00

    375.00

    375.00

    375.00

    345.00

    345.00

    325.00

    325.00

    370.00

    370.00

    345.00

    345.00

    370.00

    370.00

    375.00

    375.00

    335.00

    335.00

    335.00

    335.00

    Solutions Used:

    ---------- MEDIAN OPTION #1 -------------

    SELECT OPTN_NUM,

    AVG(BALANCE) AS MedianValue

    FROM

    (

    SELECT

    OPTN_NUM,

    BALANCE,

    ROW_NUMBER() OVER (

    PARTITION BY OPTN_NUM

    ORDER BY BALANCE) AS RowNum,

    COUNT(*) OVER (

    PARTITION BY OPTN_NUM) AS RowCnt

    FROM #TempTable

    ) x

    WHERE

    RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)

    GROUP BY OPTN_NUM

    ---------- MEDIAN OPTION #2 -------------

    WITH Counts AS

    (

    SELECT OPTN_NUM, c=COUNT(*)

    FROM #TempTable

    GROUP BY OPTN_NUM

    )

    SELECT a.OPTN_NUM, Median=AVG(0.+BALANCE)

    FROM Counts a

    CROSS APPLY

    (

    SELECT TOP(((a.c - 1) / 2) + (1 + (1 - a.c % 2)))

    BALANCE, r=ROW_NUMBER() OVER (ORDER BY BALANCE)

    FROM #TempTable b

    WHERE a.OPTN_NUM = b.OPTN_NUM

    ORDER BY BALANCE

    ) p

    WHERE r BETWEEN ((a.c - 1) / 2) + 1 AND (((a.c - 1) / 2) + (1 + (1 - a.c % 2)))

    GROUP BY a.OPTN_NUM;

    ---------- MEDIAN OPTION #3 -------------

    SELECT

    OPTN_NUM,

    AVG(BALANCE)

    FROM

    (

    SELECT

    OPTN_NUM,

    BALANCE,

    RowDesc,

    ROW_NUMBER() OVER (

    PARTITION BY OPTN_NUM

    ORDER BY BALANCE ASC, ID ASC, RowDesc DESC) AS RowAsc

    FROM (

    SELECT

    OPTN_NUM,

    BALANCE,

    ID,

    ROW_NUMBER() OVER (

    PARTITION BY OPTN_NUM

    ORDER BY BALANCE DESC, ID DESC) AS RowDesc

    FROM #TempTable) SOH

    ) x

    WHERE

    RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

    GROUP BY OPTN_NUM

    ORDER BY OPTN_NUM;

    ------------ MEDIAN OPTION #4 ---------------

    SELECT

    OPTN_NUM,

    AVG(BALANCE)

    FROM

    (

    SELECT

    OPTN_NUM,

    BALANCE,

    ROW_NUMBER() OVER (

    PARTITION BY OPTN_NUM

    ORDER BY BALANCE ASC, Id ASC) AS RowAsc,

    ROW_NUMBER() OVER (

    PARTITION BY OPTN_NUM

    ORDER BY BALANCE DESC, Id DESC) AS RowDesc

    FROM #TempTable SOH

    ) x

    WHERE

    RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

    GROUP BY OPTN_NUM;

    ORDER BY CustomerId;

    OPTN_NUM is my grouping, ID is a unique identifier that i add when populating the temp table and the balance is the actual cost of the product. For the example above I always get 370 as the median from every SQL query option I listed above.

    Any help is very much appreciated.

    Thanks!

    R.

  • Can you post DDL and sample data that we can actually use? Your solutions use 3 columns and you only showed one.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I feel like there must be more at work than what we've seen so far, because a quick test with your sample data returns 345.00 for just the first method (I haven't tested the others).

    SELECT *

    INTO #TempTable

    FROM ( VALUES ( 1, 335.00), ( 1, 335.00), ( 1, 325.00), ( 1, 325.00),

    ( 1, 375.00), ( 1, 375.00), ( 1, 375.00), ( 1, 375.00), ( 1, 375.00),

    ( 1, 375.00), ( 1, 345.00), ( 1, 345.00), ( 1, 325.00), ( 1, 325.00),

    ( 1, 370.00), ( 1, 370.00), ( 1, 345.00), ( 1, 345.00), ( 1, 370.00),

    ( 1, 370.00), ( 1, 375.00), ( 1, 375.00), ( 1, 335.00), ( 1, 335.00),

    ( 1, 335.00), ( 1, 335.00) ) Temp ( OPT_NUM, BALANCE );

    SELECT OPT_NUM ,

    AVG(BALANCE) AS MedianValue

    FROM ( SELECT BALANCE ,

    OPT_NUM ,

    ROW_NUMBER() OVER ( PARTITION BY OPT_NUM ORDER BY BALANCE ) AS RowNum ,

    COUNT(*) OVER ( PARTITION BY OPT_NUM ) AS RowCnt

    FROM #TempTable

    ) x

    WHERE RowNum IN ( ( RowCnt + 1 ) / 2, ( RowCnt + 2 ) / 2 )

    GROUP BY OPT_NUM;

    DROP TABLE #TempTable;

    As Luis mentioned, giving us the exact data you're working with will give us a better chance of figuring this out.

    Cheers!

  • First of all, tHanks for taking the time to look at this. It has been driving me nuts as to why this is not working. I actually just like your sample have created a smaller set just using those 2 fieds with a unique Identifier and it works, however my original dataset is larger with other fields that are not relevant as I am using the ones I indicated in the query to find the Median.

    First I have a temptable with other values as noted below:

    CREATE TABLE #TempTable

    (ID INT IDENTITY PRIMARY KEY,

    OPTN_NUM INT,

    OPTN_NUM_DESC_DSPLY VARCHAR(100),

    CMNTY_DESC_NUM_DSPLY VARCHAR(100),

    PLAN_DESC_ID_DSPLY VARCHAR(100),

    ACT_STATU_DSPLY VARCHAR(100),

    VENDOR_VNAME_NUM VARCHAR(100),

    ACCT_CATEGORY INT,

    BALANCE MONEY,

    ACT_phjob_type INT,

    BUDGET_NBR INT,

    MEDIAN_VALUE MONEY,

    )

    Then i do my queries to populate all other fields, except the Median, which its set to NULL. Then i run my query to find the query for each group and then do a MERGE into the table as follows:

    MERGE INTO #TempTable Temp

    USING (

    SELECT

    OPTN_NUM,

    AVG(BALANCE) AS MedianValue

    FROM

    (

    SELECT

    OPTN_NUM,

    BALANCE,

    ROW_NUMBER() OVER (

    PARTITION BY OPTN_NUM

    ORDER BY BALANCE ASC, Id ASC) AS RowAsc,

    ROW_NUMBER() OVER (

    PARTITION BY OPTN_NUM

    ORDER BY BALANCE DESC, Id DESC) AS RowDesc

    FROM #TempTable SOH

    ) x

    WHERE

    RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

    GROUP BY OPTN_NUM

    ) Median

    ON Temp.OPTN_NUM = Median.OPTN_NUM

    WHEN MATCHED THEN

    UPDATE

    SET MEDIAN_VALUE = Median.MedianValue;

    I also put just the MEDIAN query prior to the MERGE to make sure it's giving me the correct values, but nothing! Its still shows 370 for the dataset I gave you above.

    I am lost as to why it's doing that. Here is some origsample data:

    IDOPTN_NUMOPTN_NUM_DESC_DSPLYCMNTY_DESC_NUM_DSPLYPLAN_DESC_ID_DSPLYACT_STATU_DSPLYVENDOR_VNAME_NUMACCT_CATEGORYBALANCEACT_phjob_typeBUDGET_NBRMEDIAN_VALUE

    17182071820 - Bath Ctop - Master Bath - Corian - Level 1Paseo Pte1 Maravilla 3500s - 304235146 - Nuvola - 35146A-ActiveL & M LAMINATES & MARBLE, INC. - 15000122722314375.0081NULL

    Sorry, Idont know how to format better.

    THanks a million.

  • Gentlemen,

    I have found the issue with my query. I was making the mistake of calculating the Median on a NON UNIQUE dataset which I later displayed with DISTINCT, showing me the incorrect value! It was a dumb mistake, but having your input as far as supplying a data sample made me see this, so i thank you so much for your help!

    Really much appreciated as always!

    R.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply