Really Challenging Query: Break out average value across months

Question

Really Challenging Query: Break out average value across months

clary.david

Old Hand

Points: 303
More actions
April 25, 2013 at 11:47 am

#271871

Racking my brain on this one.
I have the following table
AccountID DataDate DataValue
1 1/1/2012 100
1 2/5/2012 190
1 3/18/2012 118
1 6/5/2012 490
1 6/29/2012 89
2 1/8/2012 120
2 3/5/2012 218
etc
The datadata is the end of a period -- so the datavalue for a record refers to a value that has been accumulated -- not a forward projection of value owed, if that makes sense.
I need to be able to figure out and generate a by-month average of the datavalue.
So, for instance, for the second row: 190 units, spread across a date range from 1/1/2012 to 2/5/2012 would be 36 days -- 5 of them in February. This I can do -- when the gap between reads is less than 2 months, it's easy. # of days in first month divided by total number of days in date range * value, then # of days in second month divided by total number of days in date range * value.
What I need is to be able to do, though, is handle the instances where there's a multi-month gap.
How can I take rows 4 and 5, and return a prorated average across march, may, and june?
Looking at my dataset (55 million rows from 1/1/2011 to current), I've got 2000+ instances where the data gap is 13 months. So it would have to be able to handle multi-year scenarios as well.
Anyone know a good math formula to calculate this? It's like statistical analysis or something, but I'm just a DBA, not a mathemagician.
How can I solve this easily, without a really ugly, ugly cursor and other scary methods?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Jim Mackenzie Hall of Fame Points: 3728 More actions · Answer 1

Can you firstly provide us with the set up of your queries, some test data like the below and anything you're already done?

SELECT 1 [DataId], 1 [AccountId], '2012-01-01' [DataDate], 100 [DataValue] UNION ALL

SELECT 2,1,'2012-05-02',190 UNION ALL

SELECT 3,1,'2012-03-18',118 UNION ALL

SELECT 4,1,'2012-06-05',490 UNION ALL

SELECT 5,1,'2012-06-29',89 UNION ALL

SELECT 6,2,'2012-01-08',120 UNION ALL

SELECT 7,2,'2012-03-05'

Secondly, you're talking about rows 4 and 5 like they are in different months? You've told us that each day is the end of a period, but what are the periods? Because some of the days are Sundays and some are Fridays.

Jim..

SQL SERVER Central Forum Etiquette[/url]

Geoff A SSChampion Points: 11511 More actions · Answer 2

you would need another table for a join that contained the first date of every month.

I don't think i have your answer, (its a bit tough to understand what you need) but here is a sample of code that might get your creative juices flowing;

declare @table table (AccountID int, DataDate datetime, DataValue int)

insert into @table VALUES (1, '01/01/2012',100),

(1, '02/05/2012', 190),

(1, '03/18/2012', 118),

(1, '06/5/2012', 490),

(1, '06/29/2012', 89),

(2, '01/8/2012', 120),

(2, '03/5/2012', 218)

declare @CalendarMonths table ([date] DATETIME)

DECLARE

@basedate DATETIME,

@offset INT

SELECT

@basedate = '01/01/2000',

@offset = 1

WHILE (@offset < 200)

BEGIN

INSERT INTO @CalendarMonths

SELECT DATEADD(MONTH, @offset, @basedate)

set @offset = @offset + 1

END

select * from @table

select T1.AccountID, T1.DataDate, datediff(day, T2.date, T1.DataDate) as [Diff], T1.DataValue

from @table T1

join @CalendarMonths T2 on

datepart(month,T1.DataDate) = datepart(month,T2.date)

and datepart(year,T1.DataDate) = datepart(year,T2.date)

order by T1.AccountID, T1.DataDate

clary.david Old Hand Points: 303 More actions · Answer 3

Thanks Jimbo.

Here's the basic table: it's for calculating a meter's usage over time.

meterid (varchar(12)) read date (datetime) usage (smallint)

472641501 2011-02-15 00:00:00.00077

472641501 2011-03-18 00:00:00.00079

472641501 2012-02-15 00:00:00.000190

472641501 2012-04-18 00:00:00.000120

472641501 2012-05-17 00:00:00.00020

472641501 2012-06-19 00:00:00.0000

472641501 2013-02-14 00:00:00.000164

472641501 2013-03-15 00:00:00.000160

I'm tasked with delivering a result set that is something like this:

meterid usagemonth usageyear avgusagepermonth

472641501 2 2011 78

472641501 3 2011 ??

472641501 4 2011 ??

474641501 5....

etc.

end of a period indicates end of read cycle. usually every 4 weeks, so there can be 2 inside the same month, but other times, for reasons I don't understand, it won't be read for months at a time.

clary.david Old Hand Points: 303 More actions · Answer 4

Thanks, Geoff. That's what I'm exploring now -- setting up a months table to join against. I had wanted to avoid that, but I'm not certain that I had a good reason why.

Eddie Wuerch SSChampion Points: 12535 More actions · Answer 5

Using a self-join, you can establish what a daily average rate for each entry is. Once you've gotten that, each entry will have a start and an end date.

From there, you can fold in an auxialliary table of values (a.k.a. a Nums table or tally table - just a list of numbers - Mine's called dbo.Nums, which has only one integer column, [n]) so there will be a row for each day that an original entry represents.

Once you have that, just roll it up by AcctID, Year, and Month. In the code below, a excluded the start date from the number of days used to calculate the average (only count any day as being part of one set of data). The below query doesn't need as many nested queries as shown, but I wrote it that way to show the steps.

-- prep sample data

CREATE TABLE #Ranges(AcctID int NOT NULL, DataDate date NOT NULL,

DataValue int NOT NULL,

PRIMARY KEY(AcctID, DataDate));

INSERT #Ranges(AcctID, DataDate, DataValue)

VALUES (1, '1/1/2012', 100),

(1, '2/5/2012', 190),

(1, '3/18/2012', 118),

(1, '6/5/2012', 490),

(1, '6/29/2012', 89),

(2, '1/8/2012', 120),

(2, '3/5/2012', 218);

GO

-- Actual query. See notes from inner to outer queries.

-- Could be collapsed into fewer queries,

-- but I split it out to show the steps

WITH Ranges AS

(SELECT ROW_NUMBER() OVER (PARTITION BY AcctID ORDER BY DataDate) AS rn,

AcctID, DataDate, DataValue

FROM #Ranges)

-- 3. Roll it all up by month and year

SELECT AcctID, [DataYear], [DataMonth], sum(DailyValue) AS [MonthTotal]

FROM (

-- 2. Add in the aux table of values to get a separate row for each

-- specific day with the value for that day

SELECT c.*, dateadd(d, 0, n) AS CalcDate,

year(dateadd(d, 0, n)) AS [DataYear],

month(dateadd(d, 0, n)) AS [DataMonth]

FROM (

-- 1. This query establishes the start and end date of each

-- range, along with the daily average

SELECT a.AcctID, a.DataDate AS [StartDate], b.DataDate AS [EndDate],

datediff(d, a.DataDate, b.DataDate) AS [DateRange],

b.DataValue,

b.DataValue/(1.0 * datediff(d, a.DataDate, b.DataDate)) [DailyValue],

datediff(d, 0, a.DataDate) AS StartDayNum,

datediff(d, 0, b.DataDate) AS [EndDayNum]

FROM Ranges a INNER JOIN

Ranges b ON a.AcctID = b.AcctID AND a.rn = b.rn - 1

) c INNER JOIN

dbo.Nums n ON n BETWEEN c.StartDayNum + 1 AND c.EndDayNum

) d

GROUP BY AcctID, DataYear, DataMonth

ORDER BY AcctID, DataYear, DataMonth

Eddie Wuerch
MCM: SQL

clary.david Old Hand Points: 303 More actions · Answer 6

Eddie, this looks to be exactly what I'm looking for -- when I run your query as written, though, I get no results returned. Are you missing a step in there somewhere? Am I?

clary.david Old Hand Points: 303 More actions · Answer 7

Never mind. Noob mistake. My tally table was too small.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

-- Expanding the date range out by months rather than days might be 30x cheaper

;WITH OrderedData AS (

SELECT *,

rn = ROW_NUMBER() OVER(PARTITION BY AccountID ORDER BY DataDate DESC)

FROM #Sample

)

SELECT

a.AccountID,

[Year]= YEAR(x.SliceStart),

[Month]= MONTH(x.SliceStart),

MonthTotal = SUM((z.DaysCovered*1.0/z.TotalDaysCovered) * b.DataValue)

FROM OrderedData a

INNER JOIN OrderedData b ON b.AccountID = a.AccountID AND b.rn = a.rn-1

CROSS APPLY ( -- row generator will cope with a maximum gap of 12 months, add rows for bigger gaps

SELECT TOP(1+DATEDIFF(MONTH, a.DataDate, b.DataDate)) n

FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12)) d (n)

ORDER BY n

) t

CROSS APPLY (

SELECT SliceStart = CASE WHEN t.n = 0 THEN DATEADD(DAY,1,a.DataDate)

ELSE DATEADD(MONTH,t.n+DATEDIFF(MONTH,0,a.DataDate),0) END

) x

CROSS APPLY (

SELECT SliceEnd = CASE WHEN DATEDIFF(MONTH, SliceStart, b.DataDate) > 0

THEN DATEADD(DAY,-1,DATEADD(MONTH,t.n+1+DATEDIFF(MONTH,0,a.DataDate),0))

ELSE b.DataDate END

) y

CROSS APPLY (

SELECT DaysCovered = 1+DATEDIFF(DAY,SliceStart,SliceEnd),

TotalDaysCovered = DATEDIFF(DAY,a.DataDate,b.DataDate)

) z

GROUP BY a.AccountID, YEAR(x.SliceStart), MONTH(x.SliceStart)

ORDER BY a.AccountID, YEAR(x.SliceStart), MONTH(x.SliceStart)

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden