January 17, 2013 at 12:45 pm
Hi All - My boss asked me to create a boxchart using SSRS for some of our data. I read articles all morning but honestly, I'm at a loss on how accomplish this. I think I'm growing in my SQL knowledge, but this one has me stumped. I created a sample table that looks like what I'll be working with below. If anyone could lend a hand it would really help me!
The output should be something like below with values filled in for x:
Year Median Max Min LowerQ UpperQ
2010 x x x x x
2011 x x x x x
2012 x x x x x
CREATE TABLE EStats
(
PersonIDVARCHAR(30)NOT NULL,
GradeVARCHAR(25)NOT NULL,
CourseDateDateNOT NULL
)
;
INSERT INTO EStats
(
PersonID, Grade, CourseDate
)
VALUES
('100', '91', '2010-03-01'),
('101', '96', '2010-03-01'),
('102', '88', '2010-03-01'),
('103', '92', '2010-03-01'),
('104', '81', '2010-03-01'),
('105', '85', '2010-03-01'),
('106', '91', '2010-03-01'),
('107', '89', '2010-03-01'),
('108', '99', '2010-03-01'),
('109', '88', '2010-03-01'),
('110', '81', '2011-03-02'),
('111', '77', '2011-03-02'),
('112', '88', '2011-03-02'),
('113', '76', '2011-03-02'),
('114', '69', '2011-03-02'),
('115', '70', '2011-03-02'),
('116', '75', '2011-03-02'),
('117', '88', '2011-03-02'),
('118', '76', '2011-03-02'),
('119', '95', '2012-03-01'),
('120', '96', '2012-03-01'),
('121', '90', '2012-03-01'),
('122', '80', '2012-03-01'),
('123', '85', '2012-03-01'),
('124', '94', '2012-03-01'),
('125', '89', '2012-03-01'),
('126', '97', '2012-03-01'),
('127', '94', '2012-03-01'),
('128', '72', '2012-03-01'),
('129', '88', '2012-03-01'),
('130', '91', '2012-03-01')
January 17, 2013 at 1:22 pm
Assume it is the median value that has you stumped, try reading this, it works for myself.
January 17, 2013 at 1:35 pm
I had to do this as well, so you get my supply of bookmarks to help you on your way:
http://www.sqlmag.com/article/tsql3/calculating-the-median-gets-simpler-in-sql-server-2005
http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx
The first one by Itzik Ben-Gan is the one I used.
And lastly, a link to one of my prior posts here on SSC. It includes some add'l explanatory notes I wrote to remind myself how Ben-Gan's logic actually works:
http://www.sqlservercentral.com/Forums/Topic923660-338-1.aspx#bm1105837
HTH,
Rich
January 17, 2013 at 2:27 pm
Awesome. Thanks a ton!
January 17, 2013 at 4:40 pm
Actually, I'm having some trouble converting over the example to my sample table above. I just took the grades for 03-01-2012. The median should be the average between 90 and 91 (so 90.5), correct? This gets that result set:
SELECT grade
FROM estats
WHERE CourseDate = '2012-03-01'
ORDER BY grade DESC
But when I run the below (which is an attempt to convert the example to my sample data above, I get 89.25. I'm sure I'm making a dumb mistake somewhere, but any help is appreciated.
WITH cte AS
(
SELECT EStats.PersonID, EStats.Grade,
ROW_NUMBER() OVER(PARTITION BY EStats.PersonID ORDER BY EStats.GRADE) AS RN,
COUNT(*) OVER(PARTITION BY EStats.PersonID) AS Cnt
FROM EStats
WHERE EStats.CourseDate = '2012-03-01'
)
SELECT AVG(CAST(cte.Grade AS Numeric)) AS Median
FROM cte
WHERE RN IN((cte.Cnt + 1) / 2, (Cnt + 2) / 2)
January 17, 2013 at 6:24 pm
Maybe you need to:
PARTITION BY LEFT(CourseDate, 4)
?
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
January 22, 2013 at 7:38 am
Alright, its time to confess I'm still not getting this. Would anyone mind taking my original DDL and calculating median and lower/upper quartiles? I think if I can see it with my example data it would click for me. Any help is really appreciated!
January 22, 2013 at 8:33 am
Below another attempt, but I seem to be running into the same problem. For instance, 2012-03-01 should produce 90.5, but the below gives me 89.25.
SELECT
CourseDate,
AVG(CAST(Grade AS Numeric))
FROM
(
SELECT
CourseDate,
Grade,
ROW_NUMBER() OVER (
PARTITION BY PersonID
ORDER BY Grade ASC, CourseDate ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY PersonID
ORDER BY Grade DESC, CourseDate DESC) AS RowDesc
FROM EStats
)x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CourseDate
ORDER BY CourseDate
UPDATE: I think the problem is my PARTITION BY isn't working. In other words, the inner query produces a row_number with all 1's so there is nothing to match. Any ideas on a fix?
January 22, 2013 at 8:43 am
Okay, using Dwain's suggestion I'm close to getting everything working. My only problem is the row_number in my inner query isn't working well when the numbers are the same. For instance, when you run my inner query...
SELECT
CourseDate,
Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats
...you'll notice it does this for 2010-03-01:
RowAsc
10
9
8
6
7
5
3
4
2
1
I think that is throwing off the snippet I added which Dwain suggested (THANKS, by the way!). Any suggestions on how to fix this?
For those interested, here's what I have. I'm sure its about the worst way I could do this, but its all I've got at this stage in my SQL knowledge! Once I get the row_number ordering issue worked out, I think it will work.
WITH Q3 AS
(
SELECT
CourseDate,
AVG(CAST(Grade AS Numeric)) AS Median
FROM
(
SELECT
CourseDate,
Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats
)x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CourseDate
--ORDER BY CourseDate
),
Q2 AS
(
SELECT
x.CourseDate,
AVG(CAST(Grade AS Numeric)) AS LowerQuartile
FROM
(
SELECT
Estats.CourseDate,
Estats.Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(EStats.CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(Estats.CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate
WHERE EStats.Grade < Q3.Median
)x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY x.CourseDate
),
Q4 AS
(
SELECT
x.CourseDate,
AVG(CAST(Grade AS Numeric)) AS UpperQuartile
FROM
(
SELECT
Estats.CourseDate,
Estats.Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(EStats.CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(Estats.CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate
WHERE EStats.Grade > Q3.Median
)x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY x.CourseDate
)
SELECT Q3.CourseDate, Q3.Median AS Median, Q2.LowerQuartile, Q4.UpperQuartile, MIN(EStats.Grade) AS Min, MAX(EStats.Grade) AS Max
FROM Q3
JOIN Q2 ON Q3.CourseDate = Q2.CourseDate
JOIN Q4 ON Q3.CourseDate = Q4.CourseDate
JOIN EStats ON Q3.CourseDate = EStats.CourseDate
GROUP BY Q3.CourseDate, Q3.Median, Q2.LowerQuartile, Q4.UpperQuartile
ORDER BY Q3.CourseDate
January 22, 2013 at 1:37 pm
I'm just going to bump this because I rewrote the above post and added what I've got to date...
January 22, 2013 at 6:00 pm
DataAnalyst011 (1/17/2013)
The output should be something like below with values filled in for x:
Year Median Max Min LowerQ UpperQ
2010 x x x x x
2011 x x x x x
2012 x x x x x
CREATE TABLE EStats
(
PersonIDVARCHAR(30)NOT NULL,
GradeVARCHAR(25)NOT NULL,
CourseDateDateNOT NULL
)
;
INSERT INTO EStats
(
PersonID, Grade, CourseDate
)
VALUES
('100', '91', '2010-03-01'),
('101', '96', '2010-03-01'),
('102', '88', '2010-03-01'),
('103', '92', '2010-03-01'),
('104', '81', '2010-03-01'),
('105', '85', '2010-03-01'),
('106', '91', '2010-03-01'),
('107', '89', '2010-03-01'),
('108', '99', '2010-03-01'),
('109', '88', '2010-03-01'),
('110', '81', '2011-03-02'),
('111', '77', '2011-03-02'),
('112', '88', '2011-03-02'),
('113', '76', '2011-03-02'),
('114', '69', '2011-03-02'),
('115', '70', '2011-03-02'),
('116', '75', '2011-03-02'),
('117', '88', '2011-03-02'),
('118', '76', '2011-03-02'),
('119', '95', '2012-03-01'),
('120', '96', '2012-03-01'),
('121', '90', '2012-03-01'),
('122', '80', '2012-03-01'),
('123', '85', '2012-03-01'),
('124', '94', '2012-03-01'),
('125', '89', '2012-03-01'),
('126', '97', '2012-03-01'),
('127', '94', '2012-03-01'),
('128', '72', '2012-03-01'),
('129', '88', '2012-03-01'),
('130', '91', '2012-03-01')
I would love to try and help but help me out please. Can you fill in the expected results with exactly the real results that you expect to see from the given sample data?
I initially did some fumbling around but I couldn't quite 'ken exactly what you were after so I kind of punted with the suggestion I made earlier.
If I have a clear target, I'm usually pretty good at hitting it.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
January 23, 2013 at 7:45 am
Thanks a ton, Dwain.
Just for background (may not be needed) here are a couple of links regarding the math for box-and-whisker charts (or box plots)
http://www.purplemath.com/modules/boxwhisk.htm
Essentially, I need five things: median, upper quartile, lower quartile, min, and max. Median, min, and max are pretty self explanatory. The lower quartile for a box plot takes the median of all of the numbers below the median for the entire set. The upper quartile takes the median for all the numbers above the median for the entire set.
With that in view, here should be the result set for my sample DDL:
Year Median Max Min LowerQ UpperQ
2010 90 99 81 88 92
2011 76 88 69 72.5 84.5
2012 90.5 97 72 86.5 94.5
Again, thanks alot for taking this on. If I can provide further explanation, please let me know!
January 23, 2013 at 11:30 am
EDIT: On review, the original dataset above is correct. I really apologize for the confusion. I'll put in back in when I get back to my office.
January 23, 2013 at 1:33 pm
Okay, two posts above does have the correct dataset (which was the original dataset). I was confused for a moment whether the median in a quartile included the two averaged numbers when total row count is even. It does. In the case of the values 1 through 10 the median would be 5.5, and the quartiles would be the median of everything above 5.5 (6, 7, 8, 9, 10) and below 5.5 (5, 4, 3, 2, 1). So the upper quartile would be 8 and the lower quartile would be 3.
Sorry for the momentary confusion!
January 23, 2013 at 2:50 pm
Hi
Thought I would have a go at this. I've used a CTE to order and number the results by course year.
I've also created the statistics in separate queries to try and make it a be easier to manage and read.
To determine the medians I have used rounding to determine low and high row number for each median. These can be the same number. There is probably a nicer way to do this:-)
I've left the columns in the queries that I was using to validate my median choices.
They are all joined together in the final query. I have also added a Geometry to visualize it.
;with cte as (
select ROW_NUMBER() OVER (PARTITION BY year(CourseDate) ORDER BY GRADE) RN,
COUNT(*) OVER (PARTITION BY year(CourseDate)) C,
ROUND(((COUNT(*) OVER (PARTITION BY year(CourseDate)) + 1) / 2.0) - .1, 0) HC1,
ROUND(((COUNT(*) OVER (PARTITION BY year(CourseDate)) + 1) / 2.0) + .1, 0) HC2,
year(CourseDate) CourseYear,
CAST(Grade AS NUMERIC(3)) Grade
from EStats
)
,minmax as (
select courseYear, min(Grade) minGrade, max(Grade) maxGrade, min(c) c
from cte
group by courseYear
)
,median as (
select courseYear, avg(grade) medianGrade
, cast(min(hc1) as int) hc1, cast(min(hc2) as int) hc2
from cte
where rn in (hc1, hc2)
group by courseYear
)
,lowQtr as (
select courseyear, avg(grade) lowQtrGrade
,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0)) as int) l1,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0)) as int) l2
from cte
where rn in (round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0),round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0))
group by courseyear
)
,highQtr as (
select courseyear, avg(grade) highQtrGrade
,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0) + hc1) as int) h1,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0) + hc1) as int) h2
from cte
where rn in (round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0) + hc1,round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0) + hc1)
group by courseyear
)
select mm.courseYear,
cast(md.medianGrade as numeric(5,2)) Median,
cast(mm.maxGrade as numeric(5,2)) Max,
cast(mm.minGrade as numeric(5,2)) Min,
cast(lq.lowQtrGrade as numeric(5,2)) LowerQ,
cast(hq.highQtrGrade as numeric(5,2)) UpperQ,
Geometry::STGeomFromText(
'MULTILINESTRING((' +
cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- minTick
cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- maxTick
cast(cast(medianGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(medianGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- medianTick
cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- lowerTick
cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- upperTick
cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 as varchar(5)) + ',' + cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 as varchar(5)) + '),(' + -- lowWhisker
cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 as varchar(5)) + ',' + cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 as varchar(5)) + '),(' + -- highWhisker
cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 - 2.5 as varchar(5)) + '),(' + -- box1
cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 + 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + -- box2
+ '))'
,0) graph
from minmax mm
inner join median md on md.courseyear = mm.courseyear
inner join lowQtr lq on lq.courseyear = mm.courseyear
inner join highQtr hq on hq.courseyear = mm.courseyear
Viewing 15 posts - 1 through 15 (of 20 total)
You must be logged in to reply to this topic. Login to reply