Populate new record data based on previous month record data

  • Hi SQL gurus,

    I have a fact table which looks like below. Currently, it has two quarters data loaded in which is jun and sept. In reality, there is data being captured in period 06 but not in period 09 for a particular account. Please see example below:

    [Original Fact table]

    Account | Organization | Period | Year | Amount

    -------------------------------------------------

    CMP1000 | PM1 | 06 | 2011 | 100

    CMP1001 | PM1 | 06 | 2011 | 250

    CMP1000 | PM1 | 09 | 2011 | 400

    CMP1002 | PM1 | 09 | 2011 | 500

    If there is no data in period 09 but there is data in period 06 for that particular account, I would like to insert a new record so that the data for period 09 will be taking the period 06 data. Please see my desired output below:

    [Derived Fact table]

    Account | Organization | Period | Year | Amount

    -------------------------------------------------

    CMP1000 | PM1 | 06 | 2011 | 100

    CMP1001 | PM1 | 06 | 2011 | 250

    CMP1000 | PM1 | 09 | 2011 | 400

    CMP1002 | PM1 | 09 | 2011 | 500

    CMP1001 | PM1 | 09 | 2011 | 250

    Appreciate if any gurus out there can assist me in this problem. Thanks!

  • Hi,

    This is possible using the SQL MERGE. You can do something like this :

    --Test Data

    declare @SampleTable as table(

    Account varchar(10)

    ,Organization varchar(20)

    ,Period varchar(2)

    ,[Year] int

    ,Amount money

    )

    insert into @SampleTable

    (

    Account,

    Organization,

    Period,

    [Year],

    Amount

    )

    select 'CMP1000','PM1','06',2011,100 union

    select 'CMP1001','PM1','06',2011,250 union

    select 'CMP1000','PM1','09',2011,400 union

    select 'CMP1002','PM1','09',2011,500

    -----------------------------------------------------

    ;with cte as(

    select

    Account,

    Organization,

    Period,

    [Year],

    Amount

    from

    @SampleTable

    where Period = '06'

    --<required_condition>

    )

    merge @SampleTable as T

    using (select

    Account,

    Organization,

    Period,

    [Year],

    Amount

    from cte

    ) as S on (T.Account = S.Account and T.Period = '09')

    when not matched then

    insert(Account,Organization,Period,[Year],Amount)

    values(S.Account,S.Organization,S.Period,S.[Year],S.Amount);

    select * from @SampleTable order by Period asc

    Though I have used the period as static values, you can pass it as parameters. Hope this is what you want.. 🙂

    ** MERGE is only available on SQL 2008 and higher. If you are using other than SQL 2008 (2000 or 2005) this will not work. Then you have to do a Join Update.

    --------
    Manjuke
    http://www.manjuke.com

  • Hi manjuke,

    Thanks for the reply. But your solution seems not so flexible as it is just comparing period 06 and 09. In my actual fact table, it contains period 01 to 12. It would be ideal if your script can compare something like period to period-1 and then populate a new record.

    Thanks.

  • Let's make sure I have this right. You have an "OriginalFact" table and you want to populate data into a "DerivedFact" table, using the "prior" month's data if "this" month's data is missing, without affecting the data in the "OriginalFact" table. Is that right?

    Does the data wrap years, meaning that the "prior" month to January is December?

  • Hi fahey.jonathan,

    Let's make sure I have this right. You have an "OriginalFact" table and you want to populate data into a "DerivedFact" table, using the "prior" month's data if "this" month's data is missing, without affecting the data in the "OriginalFact" table. Is that right?

    Yes

    Does the data wrap years, meaning that the "prior" month to January is December?

    Yes. Example; the prior month of January 2011 is December 2010.

  • CREATE-- DROP

    TABLEOriginalFact

    (

    AccountVARCHAR(10)NOT NULL,

    OrganizationVARCHAR(10)NOT NULL,

    [Year]INTEGERNOT NULL,

    PeriodSMALLINTNOT NULL,

    AmountINTEGERNOT NULL

    )

    CREATE-- DROP

    TABLEDerivedFact

    (

    AccountVARCHAR(10)NOT NULL,

    OrganizationVARCHAR(10)NOT NULL,

    [Year]INTEGERNOT NULL,

    PeriodSMALLINTNOT NULL,

    AmountINTEGERNOT NULL,

    TypeCodeCHAR(1)NOT NULLCHECK(TypeCode IN ('C', 'P'))

    )

    -- Sample data for the OriginalFact table.

    INSERT

    INTOOriginalFact

    (Account, Organization, [Year], Period, Amount)

    SELECTAccount, Organization, [Year], Period, Amount

    FROM(VALUES

    ('CMP1000', 'PM1',2011,11,100),

    ('CMP1001', 'PM1',2011,11,101),

    ('CMP1002', 'PM1',2011,11,102),

    ('CMP1003', 'PM1',2011,11,103),

    ('CMP1001', 'PM1',2011,12,111),

    ('CMP1002', 'PM1',2011,12,112),

    ('CMP1003', 'PM1',2011,12,113),

    ('CMP1002', 'PM1',2012,01,122),

    ('CMP1003', 'PM1',2012,01,123)

    ) x (Account, Organization, [Year], Period, Amount)

    -- Set the variables for the year and period desired.

    -- Run this section for each month.

    DECLARE@YearINTEGER= 2011

    DECLARE@PeriodSMALLINT = 11

    INSERT

    INTODerivedFact

    (Account, Organization, [Year], Period, Amount, TypeCode)

    SELECTISNULL(c.Account, p.Account) AS Account,

    ISNULL(c.Organization, p.Organization) AS Organization,

    @YearAS [Year],

    @PeriodAS Period,

    ISNULL(c.Amount, p.Amount) AS Amount,

    CASE WHEN c.Account IS NULL THEN 'P' ELSE 'C' END -- Either "C"urrent value or "P"rior value.

    FROM(-- Get records from the Current month from the OriginalFact table.

    SELECTAccount, Organization, [Year], Period, Amount

    FROMOriginalFact

    WHERE[Year] = @Year

    ANDPeriod = @Period

    ) c -- Current

    FULL JOIN (-- Get records from the Prior month.

    SELECTAccount, Organization, [Year], Period, Amount

    --FROMOriginalFact -- Use OriginalFact if values missing last month should NOT be included this month. See Notes **

    FROMDerivedFact -- Use DerivedFact if values missing last month SHOULD be included this month.

    WHERE[Year] = CASE WHEN @Period = 1 THEN @Year - 1 ELSE @Year END

    ANDPeriod = CASE WHEN @Period = 1 THEN 12 ELSE @Period - 1 END

    ) p -- Prior

    ONp.Account = c.Account

    ANDp.Organization = c.Organization

    SELECT*

    FROMDerivedFact

    ORDER BY Account,

    Organization,

    [Year],

    Period

    /* Notes

    ** CMP1000 was missing for both 2001/12 and 2012/01. If the values from 2011/11 should be

    rolled forward into 2011/12 and again into 2012/01, then use DerivedFact because it will

    contain records for CMP1000 - CMP1003. If CMP1000 should not be included in 2012/01 because

    it was not in 2011/12 or in 2012/01, then use OriginalFact because it will contain only

    CMP1001 - CMP1003 for the prior month.

    */

  • Hi fahey.jonathan,

    Based on your sample data here:

    ('CMP1000', 'PM1',2011,11,100),

    ('CMP1001', 'PM1',2011,11,101),

    ('CMP1002', 'PM1',2011,11,102),

    ('CMP1003', 'PM1',2011,11,103),

    ('CMP1001', 'PM1',2011,12,111),

    ('CMP1002', 'PM1',2011,12,112),

    ('CMP1003', 'PM1',2011,12,113),

    ('CMP1002', 'PM1',2012,01,122),

    ('CMP1003', 'PM1',2012,01,123)

    The desired output should be:

    ('CMP1000', 'PM1',2011,11,100),--Row A

    ('CMP1001', 'PM1',2011,11,101),--Row B

    ('CMP1002', 'PM1',2011,11,102),

    ('CMP1003', 'PM1',2011,11,103),

    ('CMP1001', 'PM1',2011,12,111),

    ('CMP1002', 'PM1',2011,12,112),

    ('CMP1003', 'PM1',2011,12,113),

    ('CMP1000', 'PM1', 2011, 12, 100),--Copy from Row A

    ('CMP1002', 'PM1',2012,01,122),

    ('CMP1003', 'PM1',2012,01,123),

    ('CMP1000', 'PM1',2012,01,100),--Copy from Row A

    ('CMP1001', 'PM1',2012,01,101),--Copy from Row B

    Originally, the DerivedFact table does not exist in the database. I only translate the original fact table to a derived view which is the DerivedFact table now. Sorry for the confusion.

    Thanks.

  • Two questions:

    1 - Are you saying that DerivedFact is a view, not a table?

    2 - Why is CMP1001 copied from two months prior (2011/11) rather than one month prior (2011/12)? You have the 2011/11 row labeled as Row B; I expected the 2011/12 version to be labeled as Row B. I thought the goal was to bring the prior month forward if it existed, which for 2012/01 would be 2011/12.

  • Hi,

    Apologized for my mistake. The output in the views should look like this:

    ('CMP1000', 'PM1',2011,11,100),

    ('CMP1001', 'PM1',2011,11,101),

    ('CMP1002', 'PM1',2011,11,102),

    ('CMP1003', 'PM1',2011,11,103),

    ('CMP1001', 'PM1',2011,12,111),

    ('CMP1002', 'PM1',2011,12,112),

    ('CMP1003', 'PM1',2011,12,113),

    ('CMP1000', 'PM1', 2011, 12, 100),

    ('CMP1002', 'PM1',2012,01,122),

    ('CMP1003', 'PM1',2012,01,123),

    ('CMP1000', 'PM1',2012,01,100),

    ('CMP1001', 'PM1',2012,01,111),

    The goal was to always bring the prior month forward if it existed. Sorry for my mistake.

    DerivedFact is a view, not a table because the DerivedFact originally is not in the database tables.

    Thanks.

  • I'm going to suggest a different strategy because of the complexity of the need. You need to bring forward each row from the prior month, meaning that if a row is missing for three months, you either need a recursive call to roll it forward one month at a time, or a cross join to get the last prior available record. Both of those are going to be fairly expensive, especially with a large data set. If you will be using this data with some frequency, that will probably be taxing on the server to run those queries every time.

    I recommend that you add a column to your fact table that says whether the record is a base fact or a derived fact. Once a month when you add data, insert the base facts marked with a B ("base") or F ("Fact"), etc. Then insert the derived facts from the prior month where applicable, marked with a D ("Derived"), etc. This way, when using the fact table, there are no calculations that need to be done and your derived facts are indexed along with the base facts, so queries should be fast.

    If you need a version of the table that contains only base facts, create a view for it by selecting only records with a B (or F, etc).

  • How many rows are in the base fact table? What will the growth rate be? What is the maximum number of periods that will be stored in the fact table?

  • Hi JonFox,

    Based on two periods that I had loaded in the fact table, I can estimate that the fact table has around 150,000++ record. I would say that the database size growth is quite huge every month.

    Thanks.

  • How wedded are you to the schema of your base fact table? Is this a work in progress that you can make some changes to, or is it already in production? The reason I ask is because I think there are some changes you might want to consider. When designing a dimensional schema, data such as the period and year would normally be a reference to a date dimension's surrogate key, rather than being stored directly in the fact table. This would apply to account/organization info as well. I would definitely recommend reading one of the Kimball Group books such as The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling in order to really get a handle on dimensional modeling techniques.

    While playing around with the sample data, I ended up adding a new DATE column to the OriginalFact table, to store the year/period in a way that is easier to index and compare against. Again, I would really recommend that this field actually be an attribute on a Date dimension, but this is a start. Since we have to sort and filter by year/period in order to get the most recent Amount for each Account/Date, this gives us a good value that can be indexed.

    Here's a potential solution using CROSS APPLY:

    -- If the temp table with sample data already exists, drop it

    IF OBJECT_ID('tempdb..#OriginalFact') IS NOT NULL DROP TABLE #OriginalFact;

    -- Create temp table to hold sample data

    CREATE TABLE #OriginalFact

    (

    AccountVARCHAR(10)NOT NULL,

    OrganizationVARCHAR(10)NOT NULL,

    [Year]INTEGERNOT NULL,

    PeriodSMALLINTNOT NULL,

    AmountINTEGERNOT NULL,

    PeriodStartDATE

    )

    -- In order for the technique below to be anywhere close to efficient, we must have a good index in place!

    CREATE INDEX Account_Organization_PeriodStart ON #OriginalFact (Account, Organization, PeriodStart DESC) INCLUDE (Amount)

    -- Load sample data

    INSERT

    INTO #OriginalFact

    (Account, Organization, [Year], Period, Amount)

    SELECTAccount, Organization, [Year], Period, Amount

    FROM(VALUES

    ('CMP1000', 'PM1',2011,11,100),

    ('CMP1001', 'PM1',2011,11,101),

    ('CMP1002', 'PM1',2011,11,102),

    ('CMP1003', 'PM1',2011,11,103),

    ('CMP1001', 'PM1',2011,12,111),

    ('CMP1002', 'PM1',2011,12,112),

    ('CMP1003', 'PM1',2011,12,113),

    ('CMP1002', 'PM1',2012,01,122),

    ('CMP1003', 'PM1',2012,01,123)

    ) x (Account, Organization, [Year], Period, Amount)

    -- Set the value for the PeriodStart date

    UPDATE #OriginalFact SET PeriodStart = dateadd(month,((Year-1900)*12)+Period-1,0)

    -- Common Table Expression using a CROSS JOIN to generate a set containing all

    -- Accounts/Organizations for each Year/Period. Without doing this, we would

    -- wind up with "holes" in the end result.

    ;WITH PeriodsAccounts_CTE

    AS

    (

    SELECT [Year], Period, PeriodStart, Account, Organization

    FROM

    (-- I think doing the distincts before the cross join is more efficient

    -- than doing it after; not 100% sure on this, bears testing with a

    -- larger data set.

    SELECT DISTINCT [Year], Period, PeriodStart

    FROM #OriginalFact

    ) AS D CROSS JOIN

    (

    SELECT DISTINCT Account, Organization

    FROM #OriginalFact

    ) AS A

    )

    -- Main SELECT statement

    SELECT PA.[Year], PA.Period, PA.Account, PA.Organization, F.Amount,

    CASE WHEN PA.PeriodStart = F.PeriodStart THEN 'Original'

    ELSE 'Derived'

    END AS FactType-- Flag to indicate if the row was original or derived, in case that's helpful

    FROM PeriodsAccounts_CTE AS PA CROSS APPLY

    (-- For each row in PA, get all fact rows with the same account/org

    -- that occurred on or before the current row's date.

    -- Sort by date descending and grab the TOP 1 to get most recent.

    SELECT TOP 1 Amount, PeriodStart

    FROM #OriginalFact AS O

    WHERE O.Account = PA.Account

    AND O.Organization = PA.Organization

    AND O.PeriodStart <= PA.PeriodStart

    ORDER BY O.PeriodStart DESC

    ) AS F

    Note the index; if this isn't built, performance of the CROSS APPLY will be absolutely abysmal. Even with the index in place, I suspect that this approach won't scale very well at the volumes you are working with. Still, it's worth a shot.

    I do suspect, however, that (as suggested by a fahey.jonathan) you will ultimately need to pre-calculate these values as part of your ETL process. That's the good thing about warehoused data; lengthy, complex calculations can be run overnight and persisted for fast retrieval during the day. In this type of environment, implementing costly views is not likely to be in your best interest.

    In any case, I'm interested in hearing how this performs with your larger data set.

  • Hi JonFox,

    I had tried execute your sql in my fact table which contains 123591 records.

    The performance is quite good actually...took around 2 seconds to complete...but the results seems not quite right as it only returns 72854 rows...I was expecting more than 123591 rows being returned.

    I will update my findings to you later.

    Thanks.

  • JonFox (1/21/2012)


    How wedded are you to the schema of your base fact table? Is this a work in progress that you can make some changes to, or is it already in production? The reason I ask is because I think there are some changes you might want to consider. When designing a dimensional schema, data such as the period and year would normally be a reference to a date dimension's surrogate key, rather than being stored directly in the fact table. This would apply to account/organization info as well. I would definitely recommend reading one of the Kimball Group books such as The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling in order to really get a handle on dimensional modeling techniques.

    While playing around with the sample data, I ended up adding a new DATE column to the OriginalFact table, to store the year/period in a way that is easier to index and compare against. Again, I would really recommend that this field actually be an attribute on a Date dimension, but this is a start. Since we have to sort and filter by year/period in order to get the most recent Amount for each Account/Date, this gives us a good value that can be indexed.

    Here's a potential solution using CROSS APPLY:

    -- If the temp table with sample data already exists, drop it

    IF OBJECT_ID('tempdb..#OriginalFact') IS NOT NULL DROP TABLE #OriginalFact;

    -- Create temp table to hold sample data

    CREATE TABLE #OriginalFact

    (

    AccountVARCHAR(10)NOT NULL,

    OrganizationVARCHAR(10)NOT NULL,

    [Year]INTEGERNOT NULL,

    PeriodSMALLINTNOT NULL,

    AmountINTEGERNOT NULL,

    PeriodStartDATE

    )

    -- In order for the technique below to be anywhere close to efficient, we must have a good index in place!

    CREATE INDEX Account_Organization_PeriodStart ON #OriginalFact (Account, Organization, PeriodStart DESC) INCLUDE (Amount)

    -- Load sample data

    INSERT

    INTO #OriginalFact

    (Account, Organization, [Year], Period, Amount)

    SELECTAccount, Organization, [Year], Period, Amount

    FROM(VALUES

    ('CMP1000', 'PM1',2011,11,100),

    ('CMP1001', 'PM1',2011,11,101),

    ('CMP1002', 'PM1',2011,11,102),

    ('CMP1003', 'PM1',2011,11,103),

    ('CMP1001', 'PM1',2011,12,111),

    ('CMP1002', 'PM1',2011,12,112),

    ('CMP1003', 'PM1',2011,12,113),

    ('CMP1002', 'PM1',2012,01,122),

    ('CMP1003', 'PM1',2012,01,123)

    ) x (Account, Organization, [Year], Period, Amount)

    -- Set the value for the PeriodStart date

    UPDATE #OriginalFact SET PeriodStart = dateadd(month,((Year-1900)*12)+Period-1,0)

    -- Common Table Expression using a CROSS JOIN to generate a set containing all

    -- Accounts/Organizations for each Year/Period. Without doing this, we would

    -- wind up with "holes" in the end result.

    ;WITH PeriodsAccounts_CTE

    AS

    (

    SELECT [Year], Period, PeriodStart, Account, Organization

    FROM

    (-- I think doing the distincts before the cross join is more efficient

    -- than doing it after; not 100% sure on this, bears testing with a

    -- larger data set.

    SELECT DISTINCT [Year], Period, PeriodStart

    FROM #OriginalFact

    ) AS D CROSS JOIN

    (

    SELECT DISTINCT Account, Organization

    FROM #OriginalFact

    ) AS A

    )

    -- Main SELECT statement

    SELECT PA.[Year], PA.Period, PA.Account, PA.Organization, F.Amount,

    CASE WHEN PA.PeriodStart = F.PeriodStart THEN 'Original'

    ELSE 'Derived'

    END AS FactType-- Flag to indicate if the row was original or derived, in case that's helpful

    FROM PeriodsAccounts_CTE AS PA CROSS APPLY

    (-- For each row in PA, get all fact rows with the same account/org

    -- that occurred on or before the current row's date.

    -- Sort by date descending and grab the TOP 1 to get most recent.

    SELECT TOP 1 Amount, PeriodStart

    FROM #OriginalFact AS O

    WHERE O.Account = PA.Account

    AND O.Organization = PA.Organization

    AND O.PeriodStart <= PA.PeriodStart

    ORDER BY O.PeriodStart DESC

    ) AS F

    Note the index; if this isn't built, performance of the CROSS APPLY will be absolutely abysmal. Even with the index in place, I suspect that this approach won't scale very well at the volumes you are working with. Still, it's worth a shot.

    I do suspect, however, that (as suggested by a fahey.jonathan) you will ultimately need to pre-calculate these values as part of your ETL process. That's the good thing about warehoused data; lengthy, complex calculations can be run overnight and persisted for fast retrieval during the day. In this type of environment, implementing costly views is not likely to be in your best interest.

    In any case, I'm interested in hearing how this performs with your larger data set.

    * Double post - pls ignore this reply.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply