Really Challenging Query: Break out average value across months

  • Racking my brain on this one.

    I have the following table

    AccountID DataDate DataValue

    1 1/1/2012 100

    1 2/5/2012 190

    1 3/18/2012 118

    1 6/5/2012 490

    1 6/29/2012 89

    2 1/8/2012 120

    2 3/5/2012 218

    etc

    The datadata is the end of a period -- so the datavalue for a record refers to a value that has been accumulated -- not a forward projection of value owed, if that makes sense.

    I need to be able to figure out and generate a by-month average of the datavalue.

    So, for instance, for the second row: 190 units, spread across a date range from 1/1/2012 to 2/5/2012 would be 36 days -- 5 of them in February. This I can do -- when the gap between reads is less than 2 months, it's easy. # of days in first month divided by total number of days in date range * value, then # of days in second month divided by total number of days in date range * value.

    What I need is to be able to do, though, is handle the instances where there's a multi-month gap.

    How can I take rows 4 and 5, and return a prorated average across march, may, and june?

    Looking at my dataset (55 million rows from 1/1/2011 to current), I've got 2000+ instances where the data gap is 13 months. So it would have to be able to handle multi-year scenarios as well.

    Anyone know a good math formula to calculate this? It's like statistical analysis or something, but I'm just a DBA, not a mathemagician.

    How can I solve this easily, without a really ugly, ugly cursor and other scary methods?

  • Can you firstly provide us with the set up of your queries, some test data like the below and anything you're already done?

    SELECT 1 [DataId], 1 [AccountId], '2012-01-01' [DataDate], 100 [DataValue] UNION ALL

    SELECT 2,1,'2012-05-02',190 UNION ALL

    SELECT 3,1,'2012-03-18',118 UNION ALL

    SELECT 4,1,'2012-06-05',490 UNION ALL

    SELECT 5,1,'2012-06-29',89 UNION ALL

    SELECT 6,2,'2012-01-08',120 UNION ALL

    SELECT 7,2,'2012-03-05'

    Secondly, you're talking about rows 4 and 5 like they are in different months? You've told us that each day is the end of a period, but what are the periods? Because some of the days are Sundays and some are Fridays.

    Jim..

    SQL SERVER Central Forum Etiquette[/url]

  • you would need another table for a join that contained the first date of every month.

    I don't think i have your answer, (its a bit tough to understand what you need) but here is a sample of code that might get your creative juices flowing;

    declare @table table (AccountID int, DataDate datetime, DataValue int)

    insert into @table VALUES (1, '01/01/2012',100),

    (1, '02/05/2012', 190),

    (1, '03/18/2012', 118),

    (1, '06/5/2012', 490),

    (1, '06/29/2012', 89),

    (2, '01/8/2012', 120),

    (2, '03/5/2012', 218)

    declare @CalendarMonths table ([date] DATETIME)

    DECLARE

    @basedate DATETIME,

    @offset INT

    SELECT

    @basedate = '01/01/2000',

    @offset = 1

    WHILE (@offset < 200)

    BEGIN

    INSERT INTO @CalendarMonths

    SELECT DATEADD(MONTH, @offset, @basedate)

    set @offset = @offset + 1

    END

    select * from @table

    select T1.AccountID, T1.DataDate, datediff(day, T2.date, T1.DataDate) as [Diff], T1.DataValue

    from @table T1

    join @CalendarMonths T2 on

    datepart(month,T1.DataDate) = datepart(month,T2.date)

    and datepart(year,T1.DataDate) = datepart(year,T2.date)

    order by T1.AccountID, T1.DataDate

  • Thanks Jimbo.

    Here's the basic table: it's for calculating a meter's usage over time.

    meterid (varchar(12)) read date (datetime) usage (smallint)

    472641501 2011-02-15 00:00:00.00077

    472641501 2011-03-18 00:00:00.00079

    472641501 2012-02-15 00:00:00.000190

    472641501 2012-04-18 00:00:00.000120

    472641501 2012-05-17 00:00:00.00020

    472641501 2012-06-19 00:00:00.0000

    472641501 2013-02-14 00:00:00.000164

    472641501 2013-03-15 00:00:00.000160

    I'm tasked with delivering a result set that is something like this:

    meterid usagemonth usageyear avgusagepermonth

    472641501 2 2011 78

    472641501 3 2011 ??

    472641501 4 2011 ??

    474641501 5....

    etc.

    end of a period indicates end of read cycle. usually every 4 weeks, so there can be 2 inside the same month, but other times, for reasons I don't understand, it won't be read for months at a time.

  • Thanks, Geoff. That's what I'm exploring now -- setting up a months table to join against. I had wanted to avoid that, but I'm not certain that I had a good reason why.

  • Using a self-join, you can establish what a daily average rate for each entry is. Once you've gotten that, each entry will have a start and an end date.

    From there, you can fold in an auxialliary table of values (a.k.a. a Nums table or tally table - just a list of numbers - Mine's called dbo.Nums, which has only one integer column, [n]) so there will be a row for each day that an original entry represents.

    Once you have that, just roll it up by AcctID, Year, and Month. In the code below, a excluded the start date from the number of days used to calculate the average (only count any day as being part of one set of data). The below query doesn't need as many nested queries as shown, but I wrote it that way to show the steps.

    -- prep sample data

    CREATE TABLE #Ranges(AcctID int NOT NULL, DataDate date NOT NULL,

    DataValue int NOT NULL,

    PRIMARY KEY(AcctID, DataDate));

    INSERT #Ranges(AcctID, DataDate, DataValue)

    VALUES (1, '1/1/2012', 100),

    (1, '2/5/2012', 190),

    (1, '3/18/2012', 118),

    (1, '6/5/2012', 490),

    (1, '6/29/2012', 89),

    (2, '1/8/2012', 120),

    (2, '3/5/2012', 218);

    GO

    -- Actual query. See notes from inner to outer queries.

    -- Could be collapsed into fewer queries,

    -- but I split it out to show the steps

    WITH Ranges AS

    (SELECT ROW_NUMBER() OVER (PARTITION BY AcctID ORDER BY DataDate) AS rn,

    AcctID, DataDate, DataValue

    FROM #Ranges)

    -- 3. Roll it all up by month and year

    SELECT AcctID, [DataYear], [DataMonth], sum(DailyValue) AS [MonthTotal]

    FROM (

    -- 2. Add in the aux table of values to get a separate row for each

    -- specific day with the value for that day

    SELECT c.*, dateadd(d, 0, n) AS CalcDate,

    year(dateadd(d, 0, n)) AS [DataYear],

    month(dateadd(d, 0, n)) AS [DataMonth]

    FROM (

    -- 1. This query establishes the start and end date of each

    -- range, along with the daily average

    SELECT a.AcctID, a.DataDate AS [StartDate], b.DataDate AS [EndDate],

    datediff(d, a.DataDate, b.DataDate) AS [DateRange],

    b.DataValue,

    b.DataValue/(1.0 * datediff(d, a.DataDate, b.DataDate)) [DailyValue],

    datediff(d, 0, a.DataDate) AS StartDayNum,

    datediff(d, 0, b.DataDate) AS [EndDayNum]

    FROM Ranges a INNER JOIN

    Ranges b ON a.AcctID = b.AcctID AND a.rn = b.rn - 1

    ) c INNER JOIN

    dbo.Nums n ON n BETWEEN c.StartDayNum + 1 AND c.EndDayNum

    ) d

    GROUP BY AcctID, DataYear, DataMonth

    ORDER BY AcctID, DataYear, DataMonth

    Eddie Wuerch
    MCM: SQL

  • Eddie, this looks to be exactly what I'm looking for -- when I run your query as written, though, I get no results returned. Are you missing a step in there somewhere? Am I?

  • Never mind. Noob mistake. My tally table was too small.

  • -- Expanding the date range out by months rather than days might be 30x cheaper

    ;WITH OrderedData AS (

    SELECT *,

    rn = ROW_NUMBER() OVER(PARTITION BY AccountID ORDER BY DataDate DESC)

    FROM #Sample

    )

    SELECT

    a.AccountID,

    [Year]= YEAR(x.SliceStart),

    [Month]= MONTH(x.SliceStart),

    MonthTotal = SUM((z.DaysCovered*1.0/z.TotalDaysCovered) * b.DataValue)

    FROM OrderedData a

    INNER JOIN OrderedData b ON b.AccountID = a.AccountID AND b.rn = a.rn-1

    CROSS APPLY ( -- row generator will cope with a maximum gap of 12 months, add rows for bigger gaps

    SELECT TOP(1+DATEDIFF(MONTH, a.DataDate, b.DataDate)) n

    FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12)) d (n)

    ORDER BY n

    ) t

    CROSS APPLY (

    SELECT SliceStart = CASE WHEN t.n = 0 THEN DATEADD(DAY,1,a.DataDate)

    ELSE DATEADD(MONTH,t.n+DATEDIFF(MONTH,0,a.DataDate),0) END

    ) x

    CROSS APPLY (

    SELECT SliceEnd = CASE WHEN DATEDIFF(MONTH, SliceStart, b.DataDate) > 0

    THEN DATEADD(DAY,-1,DATEADD(MONTH,t.n+1+DATEDIFF(MONTH,0,a.DataDate),0))

    ELSE b.DataDate END

    ) y

    CROSS APPLY (

    SELECT DaysCovered = 1+DATEDIFF(DAY,SliceStart,SliceEnd),

    TotalDaysCovered = DATEDIFF(DAY,a.DataDate,b.DataDate)

    ) z

    GROUP BY a.AccountID, YEAR(x.SliceStart), MONTH(x.SliceStart)

    ORDER BY a.AccountID, YEAR(x.SliceStart), MONTH(x.SliceStart)

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply