October 4, 2012 at 9:27 am
Hi All,
I have a large dataset containing distributions of a variety of measures. I need to extract what I'd call smoothed percentiles - find the appropriate record for the percentile in question, then average N values either side of it. I also need to get these results for multiple percentile points at once, eg 0.3, 0.5 , 0.7. I'm not a whiz with SQL so please tolerate ugly code in my attempt so far!
A table can be created using
CREATE TABLE PercTest
(Trial smallint, Measure char(12), Value float)
INSERT INTO PercTest
SELECT '1', 'Investments', '100.5' UNION ALL
SELECT '2', 'Investments', '104.8' UNION ALL
SELECT '3', 'Investments', '92.5' UNION ALL
SELECT '4', 'Investments', '107.1' UNION ALL
SELECT '5', 'Investments', '100.4' UNION ALL
SELECT '6', 'Investments', '93.5' UNION ALL
SELECT '7', 'Investments', '110.4' UNION ALL
SELECT '8', 'Investments', '101.3' UNION ALL
SELECT '9', 'Investments', '109.9' UNION ALL
SELECT '10', 'Investments', '98.1' UNION ALL
SELECT '1', 'Premium', '20.6' UNION ALL
SELECT '2', 'Premium', '43.3' UNION ALL
SELECT '3', 'Premium', '62.4' UNION ALL
SELECT '4', 'Premium', '10.5' UNION ALL
SELECT '5', 'Premium', '13.6' UNION ALL
SELECT '6', 'Premium', '53.8' UNION ALL
SELECT '7', 'Premium', '70.2' UNION ALL
SELECT '8', 'Premium', '39.5' UNION ALL
SELECT '9', 'Premium', '12.7' UNION ALL
SELECT '10', 'Premium', '18.5'
For these inputs, with a window size of one trial either side, what I'd hope for as an output is:
Measure | Value| Percentile
Investments | 97.33| 0.3
Premium | 14.93| 0.3
Investments |100.73| 0.5
Premium | 26.2 | 0.5
Investments |103.05| 0.7
Premium | 41.4 | 0.7
So far I have a fairly painstaking way of doing this by creating an intermediate table which adds the percentiles like so:
SELECT
Trial,
Measure,
Value,
CONVERT(float,
(RANK()OVER(
PARTITION BY Measure
ORDER BY Value ASC)))
/(MAX(Trial) OVER(PARTITION BY Measure)) AS Percentile
INTO PercTestIntermediate
FROM PercTest
And then I have this little bit of code which extracts values for all measures but only the one given percentile.
DECLARE @target float
DECLARE @window_size float
DECLARE @window_offset float
--Change the desired percentile here
SET @target = 0.7
--Change the size of the averaging window - note this is the number of trials
--which will be considered from each side. This is float to force floating
--point division
SET @window_size = 1.0
SELECT @window_offset=@window_size/MAX(Trial) FROM [ICMDataMartDev].[HISCOX\LawrencA].[PercTest]
SELECT Measure, Avg(Value) AS Value, @target AS Percentile
FROM [ICMDataMartDev].[HISCOX\LawrencA].[PercTestIntermediate]
WHERE Percentile between @target-@window_offset and @target+@window_offset
GROUP BY Measure
I'd appreciate input on the value of the intermediate table (I don't know if that could be used as one of these CTEs I've seen mentioned) but the thing that's giving me the biggest headache at the moment is trying to get multiple percentile 'windows' calculated in one query. I can run my kludgy one with different values and UNION ALL the results, but that strikes me as A Bad Way To Do Things.
What Would SQLServerCentral Folks Do?
October 16, 2013 at 1:59 pm
Try the following out. It uses Jeff Moden's DelimitedSplit8K function which I didn't include. I highly recommend that you read the (absolutely awesome!) article on it (http://www.sqlservercentral.com/articles/Tally+Table/72993/). My method uses Excel's method of calculating the percentile, which is just one of many. I selected that one because it can easily be verified. I think that this code doesn't work properly for the 0 and 100 percentiles but those are just the min and max of the data.
DECLARE @Percentiles AS VARCHAR(100) = '0.3, 0.5, 0.7'; --Desired percentiles
WITH Data
AS ( SELECT Measure ,
Value ,
ROW_NUMBER() OVER ( PARTITION BY Measure ORDER BY Value ) AS rn
FROM dbo.PercTest AS pt
),
Percentiles
AS ( SELECT CAST(Item AS DECIMAL(9, 5)) AS Percentile
FROM dbo.DelimitedSplit8k(@Percentiles,',') AS DS8K --Search SQLServerCentral for Jeff Moden's DelimitedSplit8K for code
),
Calc
AS ( SELECT Data.Measure ,
Percentiles.Percentile ,
CAST(( COUNT(*) - 1 ) * Percentiles.Percentile AS INT) --Uses Excel's version of calculating percentile
+ 1 AS k ,
( COUNT(*) - 1 ) * Percentiles.Percentile
- CAST(( COUNT(*) - 1 ) * Percentiles.Percentile AS INT) AS d ,
COUNT(*) AS totalCount
FROM Data
CROSS JOIN Percentiles
GROUP BY Data.Measure ,
Percentiles.Percentile
)
SELECT Data.Measure ,
Calc.Percentile ,
SUM(CASE WHEN Data.rn = Calc.k THEN ( 1 - Calc.d ) * Data.Value
WHEN Data.rn = Calc.k + 1 THEN Calc.d * Data.Value
ELSE NULL
END) AS Value
FROM Data
INNER JOIN Calc ON Data.Measure = Calc.Measure
AND Data.rn IN ( Calc.k, Calc.k + 1 )
GROUP BY Data.Measure ,
Calc.Percentile
ORDER BY Data.Measure ,
Calc.Percentile
Edit: Error in the SQL code
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply