April 28, 2008 at 7:02 am
Jeff Moden (4/28/2008)[/bIt really would help if the OP would post exactly what output is expected from the input in the original post, though.
I agree with all your points, I was only trying to explain why people were still answering a question that appeared to have been answered.
The question asked was definately answered.
This kind of goes with Steve editorial today. I can't even guess at how many times I have been asked how to do something, only to find out after the fact that my "Solution" caused more problems because the person asked me the question didn't understand their problem, but was successful in asking a question that would get them the answer "they" wanted, because other attempts at asking the right question got them an answer they didn't want. (whew... talk about run on 😉 , also too lazy to fix it ) Easy software, just makes making mistakes easier.
Obviously as an old Oracle DBA, I am used to things not being easy 🙂
April 29, 2008 at 7:49 am
Ok - in short, heres the problem.
I have variosu financial factors (P/E, P/B) performance - by country and by sector in an SQL table. This performance is on a daily basis. I need to create a mean, stdev, median, 95th percentile and 5th percentile for the same. I have managed the mean and stdev in the pivot table (using inbuilt functions). Was hoping for an elegant solution where I pass a column to a function which gives me the median for example. The structure is simple
I pass a column (containing n values) to a function. The function orders these, then chooses the n/2 value (if n is even, or (n+(n+1))/2 if n is odd) and returns this one value. Easy enough! Trouble is in passing the array.
Given the structure of the pivot table query, I can only ask it (e.g. if DATA is the name of the column) to SUM(DATA) or MEDIAN(DATA). It fails even on MEDIAN(DATA/5) (I know I can do DATA/5 while pulling out - this is just an example).
Reason I am using the pivot table is because the data is in a list format i.e. stock 1, date 1, data 1, stock 1, date2, data2... so on - with about a million plus rows. The pivot table takes a lot of hassle away in the presentation...
April 29, 2008 at 9:04 am
Can't do a whole lot for you if you don't post the "simple structure" of the data you're talking about... please click on, read, and provide the information from the URL in may signature.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 29, 2008 at 9:42 am
Median is mathematically not an aggregate function per se - it's more of a windowing function (meaning - it's pulling out unchanged a speicific value in a sequence). You REALLY don't want to compute that as TransactSQL function, this should be done as a derived table and then incorporated in.
Median looks something like this in SQL 2005
select groupCol1,groupCol2, orderamount from
(select groupcol1,
groupCol2,
orderamount,
row_number() over (partition by groupcol1,groupcol2 order by orderamount) RN,
Count(*) over (partition by groupcol1,groupcol2) groupcount
from Mytable) r
where rn=ceiling((groupcount+1)*0.5)
You can pivot table if need be - but perf is going to start suffering pretty badly. Perhaps dump this to a temp table, index that and pivot on that temp table.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 29, 2008 at 8:27 pm
Ok... just sharing... a million row median calculation can be fast... very fast... here's a chunk of code I posted about 3 years ago... it's demo code so you'll need to change things to suite your own tables...
--===== If the temp table already exists, drop it
IF OBJECT_ID('TempDB..#Steps') IS NOT NULL
DROP TABLE #Steps
--===== Simulate loading your real temp table with ...
-- A MILLION rows, 17576 reps (3 character "names"), a DeltaT from 0 to 1999,
-- and 4 "Steps" (Step 1, Step 2, etc)
SELECT TOP 1000000 IDENTITY(INT,1,1) AS RowNum,
CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
+ CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
+ CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS Rep,
CAST(RAND(CAST(NEWID() AS VARBINARY))*2000 AS INT) AS DeltaT,
'Step '
+ CAST(CAST(RAND(CAST(NEWID() AS VARBINARY))*4 AS INT)+1 AS VARCHAR(25)) AS MileCode
INTO #Steps
FROM Master.dbo.SYSCOLUMNS sc1,
Master.dbo.SYSCOLUMNS sc2
--===== Every table, even temporary ones, should have a Primary Key...
-- Makes all the difference in the world, performance wise, on this script.
ALTER TABLE #Steps
ADD PRIMARY KEY CLUSTERED (RowNum)
--===== And, trust me, you want this index. It's a "covering" index for both the
-- ORDER BY's, the JOINs, and the data that will appear in the code that follows.
CREATE INDEX IX_Steps ON #Steps
(Rep,MileCode,DeltaT)
--===== Return the Median for ALL reps by name and by step.
-- It returns all the "analysis" rows in just 32 seconds.
-- 1 MILLION rows, 17576 Reps, and 4 steps are simulated in the table
-- (note that because this is all random data, some reps may be missing a step, but most have 4)
-- for a total of about 70,200 Medians in just 32 seconds
-- (or 1 Median analysis in on 0.0004558 seconds)
SELECT DISTINCT
m.Rep,
m.MileCode,
(
(SELECT TOP 1 DeltaT FROM
(SELECT TOP 50 PERCENT DeltaT FROM #Steps t
WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode
ORDER BY DeltaT ASC
) lo
ORDER BY DeltaT DESC)
+(SELECT TOP 1 DeltaT FROM
(SELECT TOP 50 PERCENT DeltaT FROM #Steps t
WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode
ORDER BY DeltaT DESC
) hi
ORDER BY DeltaT ASC)
) / 2 AS MEDIAN
FROM #Steps m
ORDER BY m.Rep,m.MileCode
--Jeff Moden
Change is inevitable... Change for the better is not.
April 29, 2008 at 8:29 pm
Darned "smiley" faces... lemme try again...
--===== If the temp table already exists, drop it
IF OBJECT_ID('TempDB..#Steps') IS NOT NULL
DROP TABLE #Steps
--===== Simulate loading your real temp table with ...
-- A MILLION rows, 17576 reps (3 character "names"), a DeltaT from 0 to 1999,
-- and 4 "Steps" (Step 1, Step 2, etc)
SELECT TOP 1000000 IDENTITY(INT,1,1) AS RowNum,
CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
+ CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
+ CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS Rep,
CAST(RAND(CAST(NEWID() AS VARBINARY))*2000 AS INT) AS DeltaT,
'Step '
+ CAST(CAST(RAND(CAST(NEWID() AS VARBINARY))*4 AS INT)+1 AS VARCHAR(25)) AS MileCode
INTO #Steps
FROM Master.dbo.SYSCOLUMNS sc1,
Master.dbo.SYSCOLUMNS sc2
--===== Every table, even temporary ones, should have a Primary Key...
-- Makes all the difference in the world, performance wise, on this script.
ALTER TABLE #Steps
ADD PRIMARY KEY CLUSTERED (RowNum)
--===== And, trust me, you want this index. It's a "covering" index for both the
-- ORDER BY's, the JOINs, and the data that will appear in the code that follows.
CREATE INDEX IX_Steps ON #Steps
(Rep,MileCode,DeltaT)
--===== Return the Median for ALL reps by name and by step.
-- It returns all the "analysis" rows in just 32 seconds.
-- 1 MILLION rows, 17576 Reps, and 4 steps are simulated in the table
-- (note that because this is all random data, some reps may be missing a step, but most have 4)
-- for a total of about 70,200 Medians in just 32 seconds
-- (or 1 Median analysis in on 0.0004558 seconds)
SELECT DISTINCT
m.Rep,
m.MileCode,
(
(SELECT TOP 1 DeltaT FROM
(SELECT TOP 50 PERCENT DeltaT FROM #Steps t
WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode
ORDER BY DeltaT ASC
) lo
ORDER BY DeltaT DESC)
+(SELECT TOP 1 DeltaT FROM
(SELECT TOP 50 PERCENT DeltaT FROM #Steps t
WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode
ORDER BY DeltaT DESC
) hi
ORDER BY DeltaT ASC))
/ 2 AS MEDIAN
FROM #Steps m
ORDER BY m.Rep,m.MileCode
--Jeff Moden
Change is inevitable... Change for the better is not.
April 29, 2008 at 9:15 pm
..by the way - it's probably worth pointing out that there are multiple definitions of Median.
The solution Jeff is showing is what is called the statistical median (the average of the two middle elements in an ordered set if the set is even, and the middle item it the set is odd), and mine was the "pure math" (from ordered sets/sequential math) definition (middle element if the set is odd and the (n/2+1)th item if it's even).
Depending on how you talk to - they might expect one definition or the other.
And in terms of efficiency - if you do it as a derived table - it should be quite fast either way. The function would NOT be so lucky however.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 29, 2008 at 10:31 pm
(Jeff - psst)....
(1000000 row(s) affected)
=====================
Jeff's Statistical Median Calc
=====================
(70266 row(s) affected)
SQL Server Execution Times:
CPU time = 70031 ms, elapsed time = 42039 ms.
========================
Matt's Mathematical Median Calc
========================
(70266 row(s) affected)
SQL Server Execution Times:
CPU time = 16938 ms, elapsed time = 18221 ms.
So in case someone doesn't specify which one they want - you might get away with the "faster" one....:)
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 29, 2008 at 10:49 pm
One more - this is the "windowed function" version of the statistical median:
select r.Rep,r.MileCode, (r.deltat +r1.DeltaT)*.5 dt from
(select m.Rep,m.MileCode,
deltat,
row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,
Count(*) over (partition by m.Rep,m.MileCode) groupcount
from #Steps m) r inner join
(select m.Rep,m.MileCode,
deltat,
row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,
Count(*) over (partition by m.Rep,m.MileCode) groupcount
from #Steps m) r1 on r.rep=r1.rep and r.milecode=r1.milecode
where r.rn=ceiling((r.groupcount+1)*0.5)
and r1.rn=floor((r1.groupcount+1)*0.5)
And perf:
SQL Server Execution Times:
CPU time = 30969 ms, elapsed time = 32954 ms.
So - half as fast as the math Median, but twice as fast as the 2000 version of the statistical median....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 29, 2008 at 11:11 pm
Matt Miller (4/29/2008)
(Jeff - psst)....
(1000000 row(s) affected)
=====================
Jeff's Statistical Median Calc
=====================
(70266 row(s) affected)
SQL Server Execution Times:
CPU time = 70031 ms, elapsed time = 42039 ms.
========================
Matt's Mathematical Median Calc
========================
(70266 row(s) affected)
SQL Server Execution Times:
CPU time = 16938 ms, elapsed time = 18221 ms.
So in case someone doesn't specify which one they want - you might get away with the "faster" one....:)
Outstanding! Heh... that's what I get for not testing both sides first.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 29, 2008 at 11:13 pm
Say... do you happen to have the full test slab for that 16 second version?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 30, 2008 at 6:30 am
Jeff Moden (4/29/2008)
Say... do you happen to have the full test slab for that 16 second version?
Sure - I stole your test setup....
--===== If the temp table already exists, drop it
IF OBJECT_ID('TempDB..#Steps') IS NOT NULL
DROP TABLE #Steps
--===== Simulate loading your real temp table with ...
-- A MILLION rows, 17576 reps (3 character "names"), a DeltaT from 0 to 1999,
-- and 4 "Steps" (Step 1, Step 2, etc)
SELECT TOP 1000000 IDENTITY(INT,1,1) AS RowNum,
CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
+ CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
+ CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS Rep,
CAST(RAND(CAST(NEWID() AS VARBINARY))*2000 AS INT) AS DeltaT,
'Step '
+ CAST(CAST(RAND(CAST(NEWID() AS VARBINARY))*4 AS INT)+1 AS VARCHAR(25)) AS MileCode
INTO #Steps
FROM Master.dbo.SYSCOLUMNS sc1,
Master.dbo.SYSCOLUMNS sc2
--===== Every table, even temporary ones, should have a Primary Key...
-- Makes all the difference in the world, performance wise, on this script.
ALTER TABLE #Steps
ADD PRIMARY KEY CLUSTERED (RowNum)
--===== And, trust me, you want this index. It's a "covering" index for both the
-- ORDER BY's, the JOINs, and the data that will appear in the code that follows.
CREATE INDEX IX_Steps ON #Steps
(Rep,MileCode,DeltaT)
go
--===== Return the Median for ALL reps by name and by step.
-- It returns all the "analysis" rows in just 32 seconds.
-- 1 MILLION rows, 17576 Reps, and 4 steps are simulated in the table
-- (note that because this is all random data, some reps may be missing a step, but most have 4)
-- for a total of about 70,200 Medians in just 32 seconds
-- (or 1 Median analysis in on 0.0004558 seconds)
Print replicate('=',50)
print 'Jeffs Statistical Median'
Print replicate('=',50)
Set statistics time on
SELECT DISTINCT
m.Rep,
m.MileCode,
(
(SELECT TOP 1 DeltaT FROM
(SELECT TOP 50 PERCENT DeltaT FROM #Steps t
WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode
ORDER BY DeltaT ASC
) lo
ORDER BY DeltaT DESC)
+(SELECT TOP 1 DeltaT FROM
(SELECT TOP 50 PERCENT DeltaT FROM #Steps t
WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode
ORDER BY DeltaT DESC
) hi
ORDER BY DeltaT ASC))
/ 2 AS MEDIAN
FROM #Steps m
ORDER BY m.Rep,m.MileCode
Set statistics time off
Print replicate('=',50)
print 'Windowed Statistical Median'
Print replicate('=',50)
Set statistics time on
select r.Rep,r.MileCode, (r.deltat +r1.DeltaT)*.5 dt from
(select m.Rep,m.MileCode,
deltat,
row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,
Count(*) over (partition by m.Rep,m.MileCode) groupcount
from #Steps m) r inner join
(select m.Rep,m.MileCode,
deltat,
row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,
Count(*) over (partition by m.Rep,m.MileCode) groupcount
from #Steps m) r1 on r.rep=r1.rep and r.milecode=r1.milecode
where r.rn=ceiling((r.groupcount+1)*0.5)
and r1.rn=floor((r1.groupcount+1)*0.5)
Set statistics time off
Print replicate('=',50)
print 'Windowed Math Median'
Print replicate('=',50)
Set statistics time on
select r.Rep,r.MileCode, r.deltat from
(select m.Rep,m.MileCode,
deltat,
row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,
Count(*) over (partition by m.Rep,m.MileCode) groupcount
from #Steps m) r
where r.rn=ceiling((r.groupcount+1)*0.5)
Set statistics time off
I was being funny last night - just remember the "math" version and the "stats" version actually give you different results (based on their different definitions)....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 30, 2008 at 7:11 am
I forgot that my test machine was running a full backup and a virus scan last night...Without the impediments:
(1000000 row(s) affected)
==================================================
Jeffs Statistical Median
==================================================
(70271 row(s) affected)
SQL Server Execution Times:
CPU time = 22266 ms, elapsed time = 12282 ms.
==================================================
Windowed Statistical Median
==================================================
(70271 row(s) affected)
SQL Server Execution Times:
CPU time = 11765 ms, elapsed time = 11950 ms.
==================================================
Windowed Math Median
==================================================
(70271 row(s) affected)
SQL Server Execution Times:
CPU time = 5719 ms, elapsed time = 5727 ms.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 30, 2008 at 8:05 am
Matt Miller (4/30/2008)
I was being funny last night - just remember the "math" version and the "stats" version actually give you different results (based on their different definitions)....
Not a problem... I'm just interested in how you're doing this and wanted to play a bit. Thanks, Matt.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 30, 2008 at 8:18 am
Jeff Moden (4/30/2008)
Matt Miller (4/30/2008)
I was being funny last night - just remember the "math" version and the "stats" version actually give you different results (based on their different definitions)....Not a problem... I'm just interested in how you're doing this and wanted to play a bit. Thanks, Matt.
By the way - once the caffeine infusion kicked in this morning - I realized that I'm taking two passes to do what I can do in just one in the "windowed Stat version".
Set statistics time off
Print replicate('=',50)
print 'Windowed Stat Median - method #2'
Print replicate('=',50)
Set statistics time on
select r.Rep,r.MileCode, Avg(r.deltat) dt
from
(select m.Rep,m.MileCode,
deltat,
row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,
Count(*) over (partition by m.Rep,m.MileCode) groupcount
from #Steps m) r
where r.rn = ceiling((r.groupcount+1)*0.5) or
r.rn = floor((r.groupcount+1)*0.5)
group by r.Rep,r.MileCode
order by r.Rep,r.MileCode
Set statistics time off
Damn near as fast as the math median at this point....
SQL Server Execution Times:
CPU time = 6266 ms, elapsed time = 6421 ms.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 15 posts - 16 through 29 (of 29 total)
You must be logged in to reply to this topic. Login to reply