June 12, 2007 at 1:19 pm
Hello,
First time posting, stuck on a problem. Please bear with me if what I say is not proper for i am probably not worthy compared to the rest of you.
I am stuck on the coding as well as the approach to solve this problem.
I am trying to get the following values for these fields:
1)Number of Orders - I use count(distinct orderID) , 2) Days BackOrdered - I use datediff(days,orders.orderdate,getdate())
Median -
Mode -
Mean -
so essentially, i should have the median(NumberofOrders),median(daysbackordered),mode(NumberofOrders)
,mode(daysbackordered),mean(NumberofOrders),mean(DaysBackOrdered)
This is going to be grouped by an OrderChannel and Client.
1 client has many OrderChannels
Here is the basic structure of the tables invovled (There are more tables invovled, but these are probably all that matter to the problem)
Orders OrderChannels Client
1)orderID 1)OrderChannelID 1)ClientID
2)orderDate 2)ClientID
3)OrderChannelID
Problems:
1) I don't know how to calculate for mode, median, mean(average())
*** Attempted to find the median, however most things I have read require you to know the rowcount of the return set, for even and odd there is different code, I am assuming you can combine them in a case statement.. Is there a generic way without knowing how many rows are returned?
2) I don't know how to approach this problem.
Please all help would be appreciated, I am a SQL noobie.
First time on this site but read through some threads so I decided to register
David Jackson Or David Mcpharlen please help me wise ones!
you guys rock!
June 12, 2007 at 3:12 pm
to get a median:
select top 1 [value_a]
from (select top 50 percent [value_a] from [junkit] order by [value_a] asc) a
order by [value_a] desc
to get mode
select top 1 [value_a] from (select top 100 percent [value_a],count([value_a]) as cnt from [junkit] group by [value_a] order by count([value_a]) desc) a
June 12, 2007 at 3:26 pm
I Need the median, mode, mean for 2 functions, one being a count and the other is a datediff -
SHould I Perform the count and datediff insert the values into a temp, then sum those values by orderchannel(what i Need) insert that into another temp table, then getting the mode, mean, median from there?
This should be a summary report which shows the total # of orders and total Number of Days BackOrdered
by OrderChannel
June 12, 2007 at 3:34 pm
June 12, 2007 at 3:39 pm
Thanks for the prompt reponse.....
In my original post, isn't that enough information for you
June 12, 2007 at 5:02 pm
No, mrpolecat needs to know how you want to handle ties, whether they be in the form of even number of rows for median, or equal maximum frequencies for mode. I'll demonstrate the problem, and you can tell us what should happen in these cases.
Standard median with odd number of rows
1 4 10 - Median = 4.
Now, with even number of rows:
1 4 10 15 - Median = 4 or 10, depending on how you want it handled. Both are equally in the middle.
The mode demonstration without ties
1 1 2 3 4 4 6 7 7 7 8 8 8 8 9 - Mode = 8.
Now, with a tie.
1 1 1 1 2 3 4 4 5 5 5 5 6 6 - Mode = 1 or 5, depending on how you want it handled. Both have the same frequency of occurrence.
Answer those questions, and the code can be tweaked.
June 12, 2007 at 5:41 pm
June 12, 2007 at 6:05 pm
I had versions of all three at a previous contract, but their's were overly complicated for your needs, as they had all sorts of parameters allowing you to eliminate zeros, drop top and bottom x% prior to calculating, etc., if desired. They also weren't particularly fast because of those modifications.
June 12, 2007 at 10:36 pm
There are two basic types of medians and, as David McFarland points out, there are several special mods like not counting zeros, dropping the ends, and a whole bunch of other renditions depending on how you're trying to bend the numbers.
But, the two basic types of Medians are the Statisical Median and the Mathematical Median. In their unadulterated state, they produce the same result for an odd number of rows.
Let's talk about an even number of rows... think "Integer Math" when you think of Statisical Median and, borrowing from my other post on the subject, let's say we have 12 rows numbered 1 through 12... Like MrPoleCat's algorithym, the 6th row, whatever it contains (and sorted, of course), will be the Statisical Median. A way to calculate that the 6th row is the Statisical Median is to just use Integer math...
PRINT 12/2 --Implicit Integer Math.
-----------
6
(1 row(s) affected)
The "algorithym" used can be construed to mean the largest top 1 of the first 50% (indicated by /2) just as MrPoleCat's code does.
Now, the Mathematical Median for those 12 rows is just a wee bit different... you'd need to find the values for the 6th and 7th rows (they are each nearest the middle of the 12 row ordered set) and take the average of those two values or (6+7)/2.0 or 6.5. Notice that we're not just dividing by 2 which would be implicit integer math... we're dividing by 2.0 which is implicit FLOAT math.
If you have SQL Server 2005, then Itzik Ben-Gan has some great median calculations using some of the newer functions like RowNumber and Partition OVER to solve the problem in a very elegant fashion. Those solutions can be found at http://www.sqlmag.com/Articles/Index.cfm?ArticleID=49827&DisplayTab=Article . I can't test those solutions because I don't have SQL Server 2005, so I can't vouch for their performance one way or another.
I've verified that his SQL Server 2000 solutions aren't so hot, though, because they use aggragates to solve for the medians and THOSE take a very long time.
But first, we need a pant load of test data A million rows should do just fine... since you're a "newbie", you need to lookup CROSS JOINS, the IDENTITY function, SELECT/INTO, and a whole wad of other things to understand how the following snippet of code works to build a million rows of randomized test data...
--===== Create and populate a 1,000,000 row test table.
-- Column RowNum has a range of 1 to 1,000,000 unique numbers
-- Column "SomeInt" has a range of 1 to 1,000 non-unique numbers
-- Column "SomeString" has a range of "A" to "Z" non-unique 1 character strings
-- Column "SomeNumber has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
-- Takes about 42 seconds to execute.
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
SomeInt = CAST(RAND(CAST(NEWID() AS VARBINARY))*1000+1 AS INT),
SomeString = CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)),
SomeCSV = 'Part1,Part2,Part3,Part4,Part5,Part6,Part7,Part8,Part9,Part10',
SomeNumber = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY),
SomeDate = CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME)
INTO dbo.JBMTest
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2 --Lack of join criteria makes this a CROSS-JOIN
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE dbo.JBMTest
ADD PRIMARY KEY CLUSTERED (RowNum)
Now, we can "play"...
If we want the Statisical Mean of ALL of the rows in the RowNum column, we could use MrPoleCat's code...
SELECT TOP 1 RowNum
FROM (--Derived table "d" finds the top half of all the RowNum's
SELECT TOP 50 PERCENT RowNum
FROM dbo.JBMTest
ORDER BY RowNum ASC
)d
ORDER BY d.RowNum DESC
And we get the expected 500,000 answer (just to prove his code works absolutely correctly... we're not actually going to use RowNum for anything else)
We can do the same with the "SomeNumber" column, as well...
--===== Find Statisical median of the SomeNumber column
SELECT TOP 1 SomeNumber
FROM (--Derived table "d" finds the top half of all the SomeNumber's
SELECT TOP 50 PERCENT SomeNumber
FROM dbo.JBMTest
ORDER BY SomeNumber ASC
)d
ORDER BY d.SomeNumber DESC
Because of the randomness of the numbers, you won't get exactly 50.0000 (about half of 99.9999) but it'll be close and gives us a marker just for a sanity check.
Again, that's the Statisical Median... how on Earth are we going to get the Mathematical Median which requires us to find the two numbers closest to the center? Again, we'll use MrPoleCat's code with a twist from the other side of the data (Note that the RowNum column is of absolutely no help here because the SomeNumber data is NOT in order!)...
--===== Find Mathematical Median of SomeNumber
SELECT (
(
SELECT TOP 1 lo.SomeNumber
FROM (--Derived table "lo" finds the lowest half of all the SomeNumber's
SELECT TOP 50 PERCENT SomeNumber
FROM dbo.JBMTest
ORDER BY SomeNumber ASC
)lo
ORDER BY lo.SomeNumber DESC
)
+ --Lets add the two numbers, whatever they are, together (same # for odd number of rows)
(
SELECT TOP 1 hi.SomeNumber
FROM (--Derived table "hi" finds the highest half of all the SomeNumber's
SELECT TOP 50 PERCENT SomeNumber
FROM dbo.JBMTest
ORDER BY SomeNumber DESC
)hi
ORDER BY hi.SomeNumber ASC
)
) / 2.0
... that's IF you want the median for the WHOLE set of rows... and, IF that's all you need, we're all done...
But, your original post stated that you have multiple "OrderChannels"... let's just pretend that the "SomeString" column (A through Z or 26 combinations of 1 letter each) is your "OrderChannelID". How are we going to calculate the Mathematical Median for each? Cursor? WHILE LOOP? Again, as MrPoleCat suggested, "No", theres another way... all we have to do is relate the calculations we do for Median to each "OrderChannelID"... we can do that with a handy little tool known as a "Correlated Subquery" in the SELECT list.
A "Correlated Subquery" is nothing more than a query that "correlates" to something outside the subquery... in this case, "OrderChannelID"... (Don't forget... we're pretending that SomeString is the "OrderChannelID")... (I've bolded all the new code added to the previous code below).
Here's the code... uh, wait... we're gonna need another index first... run this to create the index we need to make this finish with a bit of performance in mind...
CREATE INDEX IX_JBMTest_SomeString_SomeNumber
ON dbo.JBMTest (SomeString, SomeNumber DESC)
Now, here's the code to provide the Mathematical Median for each "OrderChannelID" (SomeString).
--===== Find Mathematical Median of SomeNumber for each "OrderChannelID" (SomeString)
SELECT DISTINCT
m.SomeString,
(
(
SELECT TOP 1 lo.SomeNumber
FROM (--Derived table "lo" finds the lowest half of all the SomeNumber's
SELECT TOP 50 PERCENT tlo.SomeNumber
FROM dbo.JBMTest tlo
WHERE tlo.SomeString = m.SomeString
ORDER BY tlo.SomeNumber ASC
)lo
ORDER BY lo.SomeNumber DESC
)
+ --Lets add the two numbers, whatever they are, together (same # for odd number of rows)
(
SELECT TOP 1 hi.SomeNumber
FROM (--Derived table "hi" finds the highest half of all the SomeNumber's
SELECT TOP 50 PERCENT thi.SomeNumber
FROM dbo.JBMTest thi
WHERE thi.SomeString = m.SomeString
ORDER BY thi.SomeNumber DESC
)hi
ORDER BY hi.SomeNumber ASC
)
) / 2.0 AS Median
FROM dbo.JBMTest m
ORDER BY m.SomeString
Imagine that... 26 median calculations on a million rows of data in about 19 seconds...
Still not done, though, are we? You also said you wanted it by ClientID by OrderChannelID... as before... let's make another index for a bit of speed (you can drop the other one, if you want)... (note: We'll use the "SomeINT" column to simulate 1,000 ClientID's)...
CREATE INDEX IX_JBMTest_SomeInt_SomeString_SomeNumber
ON dbo.JBMTest (SomeInt, SomeString, SomeNumber DESC)
By the way... without the index above, the code below takes almost 5 minutes to run ... with the index, it only takes about 21 seconds to return 26,000 median results from a million rows... Again, I've bolded the code that was added...
--===== Find Mathematical Median of SomeNumber for each ClientID (SomeINT) and "OrderChannelID" (SomeString)
SELECT DISTINCT
m.SomeInt,
m.SomeString,
(
(
SELECT TOP 1 lo.SomeNumber
FROM (--Derived table "lo" finds the lowest half of all the SomeNumber's
SELECT TOP 50 PERCENT tlo.SomeNumber
FROM dbo.JBMTest tlo
WHERE tlo.SomeINT = m.SomeInt AND tlo.SomeString = m.SomeString
ORDER BY tlo.SomeNumber ASC
)lo
ORDER BY lo.SomeNumber DESC
)
+ --Lets add the two numbers, whatever they are, together (same # for odd number of rows)
(
SELECT TOP 1 hi.SomeNumber
FROM (--Derived table "hi" finds the highest half of all the SomeNumber's
SELECT TOP 50 PERCENT thi.SomeNumber
FROM dbo.JBMTest thi
WHERE thi.SomeINT = m.SomeInt AND thi.SomeString = m.SomeString
ORDER BY thi.SomeNumber DESC
)hi
ORDER BY hi.SomeNumber ASC
)
) / 2.0 AS Median
FROM dbo.JBMTest m
ORDER BY m.SomeInt, m.SomeString
I'm not going to do all the examples you asked for... but similar methods can be created for dates, etc. And, for the MODE, MrPoleCat's code works just fine... but you still need to tell folks how you want to handle "ties" before anyone can help you with that.
The rest of it is all just "normal" SQL... lookup COUNT and AVERAGE in Books OnLine to figure that stuff out.
Oh, yeah... why did Itzik Ben-Gan use the slower aggragate method? There's a nasty rumor (I call it a "rumor" only because I can't test it... no SQL Server 2005 for me, yet) that "Nested Tops" don't work especially in SQL Server 2005... dunno if that's true for 2k5 but that rumor is a wad of hooie for SQL Server 2000... ya just gotta hold your mouth right when you're doing it
Lemme know if all that helps, eh?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 13, 2007 at 6:52 am
Jeff: Now that is a heck of an answer. Awsome stuff. I've paid for answers that were not half that good Thanks I learned a few things today, and copied the article/code to my "keep for future" reference script directory.
James.
June 13, 2007 at 7:26 am
June 13, 2007 at 8:25 am
Can somebody please use the tables and columns that I have provided in the original post?
I need to sum orderID grouped by order channel to know the number of orders per orderchannel
I also need the life of the order, which I use datediff(days,orders.orderdate,getdate())
There are many Orderchannels To a Client
Then I want to show the breakdown of the client by orderchannels the mode, mean, median for Number of Orders and Days ordered.
I do not know how many rows are in the result set, even or odd.
Please use the tables I offered in the original post, I am sure I provided sufficient information.
June 13, 2007 at 8:55 am
Jeff did as close to that as possible unless you post your create table statements so we can have the tablenames, column names, datatypes and indexes.
Take Jeff's example and sub in your field names and table names just where he showed you.
somenumber = the value you want the median for
somestring = order channel id
someint = client id
He didn't just teach you to fish, he handed you a top of the line rod and a fish finder.
June 13, 2007 at 9:34 am
Besides, Wutang... you don't want us to have ALL the fun, do ya?
MrPoleCat and JLK... thank you guys for the awesome compliments. Great way to start the day!
--Jeff Moden
Change is inevitable... Change for the better is not.
June 13, 2007 at 9:55 am
I am going to take in all the information and try to soak up my little brain.
I am very pleased at the over all responses, response times, as well as the kindness I recieved from my post.
This was my first time and def not my last. I see myself being much more active here so hopefully I can learn a thing or two.
Thanks again all, you made my first time a great experience, hope to see you guys around...... Now I know if i put the batsignal out you guys will come to my rescue!
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply