January 16, 2012 at 11:35 pm
Hi SQL gurus,
I have a fact table which looks like below. Currently, it has two quarters data loaded in which is jun and sept. In reality, there is data being captured in period 06 but not in period 09 for a particular account. Please see example below:
[Original Fact table]
Account | Organization | Period | Year | Amount
-------------------------------------------------
CMP1000 | PM1 | 06 | 2011 | 100
CMP1001 | PM1 | 06 | 2011 | 250
CMP1000 | PM1 | 09 | 2011 | 400
CMP1002 | PM1 | 09 | 2011 | 500
If there is no data in period 09 but there is data in period 06 for that particular account, I would like to insert a new record so that the data for period 09 will be taking the period 06 data. Please see my desired output below:
[Derived Fact table]
Account | Organization | Period | Year | Amount
-------------------------------------------------
CMP1000 | PM1 | 06 | 2011 | 100
CMP1001 | PM1 | 06 | 2011 | 250
CMP1000 | PM1 | 09 | 2011 | 400
CMP1002 | PM1 | 09 | 2011 | 500
CMP1001 | PM1 | 09 | 2011 | 250
Appreciate if any gurus out there can assist me in this problem. Thanks!
January 17, 2012 at 1:00 am
Hi,
This is possible using the SQL MERGE. You can do something like this :
--Test Data
declare @SampleTable as table(
Account varchar(10)
,Organization varchar(20)
,Period varchar(2)
,[Year] int
,Amount money
)
insert into @SampleTable
(
Account,
Organization,
Period,
[Year],
Amount
)
select 'CMP1000','PM1','06',2011,100 union
select 'CMP1001','PM1','06',2011,250 union
select 'CMP1000','PM1','09',2011,400 union
select 'CMP1002','PM1','09',2011,500
-----------------------------------------------------
;with cte as(
select
Account,
Organization,
Period,
[Year],
Amount
from
@SampleTable
where Period = '06'
--<required_condition>
)
merge @SampleTable as T
using (select
Account,
Organization,
Period,
[Year],
Amount
from cte
) as S on (T.Account = S.Account and T.Period = '09')
when not matched then
insert(Account,Organization,Period,[Year],Amount)
values(S.Account,S.Organization,S.Period,S.[Year],S.Amount);
select * from @SampleTable order by Period asc
Though I have used the period as static values, you can pass it as parameters. Hope this is what you want.. 🙂
** MERGE is only available on SQL 2008 and higher. If you are using other than SQL 2008 (2000 or 2005) this will not work. Then you have to do a Join Update.
--------
Manjuke
http://www.manjuke.com
January 17, 2012 at 1:58 am
Hi manjuke,
Thanks for the reply. But your solution seems not so flexible as it is just comparing period 06 and 09. In my actual fact table, it contains period 01 to 12. It would be ideal if your script can compare something like period to period-1 and then populate a new record.
Thanks.
January 17, 2012 at 1:28 pm
Let's make sure I have this right. You have an "OriginalFact" table and you want to populate data into a "DerivedFact" table, using the "prior" month's data if "this" month's data is missing, without affecting the data in the "OriginalFact" table. Is that right?
Does the data wrap years, meaning that the "prior" month to January is December?
January 17, 2012 at 7:41 pm
Hi fahey.jonathan,
Let's make sure I have this right. You have an "OriginalFact" table and you want to populate data into a "DerivedFact" table, using the "prior" month's data if "this" month's data is missing, without affecting the data in the "OriginalFact" table. Is that right?
Yes
Does the data wrap years, meaning that the "prior" month to January is December?
Yes. Example; the prior month of January 2011 is December 2010.
January 17, 2012 at 8:53 pm
CREATE-- DROP
TABLEOriginalFact
(
AccountVARCHAR(10)NOT NULL,
OrganizationVARCHAR(10)NOT NULL,
[Year]INTEGERNOT NULL,
PeriodSMALLINTNOT NULL,
AmountINTEGERNOT NULL
)
CREATE-- DROP
TABLEDerivedFact
(
AccountVARCHAR(10)NOT NULL,
OrganizationVARCHAR(10)NOT NULL,
[Year]INTEGERNOT NULL,
PeriodSMALLINTNOT NULL,
AmountINTEGERNOT NULL,
TypeCodeCHAR(1)NOT NULLCHECK(TypeCode IN ('C', 'P'))
)
-- Sample data for the OriginalFact table.
INSERT
INTOOriginalFact
(Account, Organization, [Year], Period, Amount)
SELECTAccount, Organization, [Year], Period, Amount
FROM(VALUES
('CMP1000', 'PM1',2011,11,100),
('CMP1001', 'PM1',2011,11,101),
('CMP1002', 'PM1',2011,11,102),
('CMP1003', 'PM1',2011,11,103),
('CMP1001', 'PM1',2011,12,111),
('CMP1002', 'PM1',2011,12,112),
('CMP1003', 'PM1',2011,12,113),
('CMP1002', 'PM1',2012,01,122),
('CMP1003', 'PM1',2012,01,123)
) x (Account, Organization, [Year], Period, Amount)
-- Set the variables for the year and period desired.
-- Run this section for each month.
DECLARE@YearINTEGER= 2011
DECLARE@PeriodSMALLINT = 11
INSERT
INTODerivedFact
(Account, Organization, [Year], Period, Amount, TypeCode)
SELECTISNULL(c.Account, p.Account) AS Account,
ISNULL(c.Organization, p.Organization) AS Organization,
@YearAS [Year],
@PeriodAS Period,
ISNULL(c.Amount, p.Amount) AS Amount,
CASE WHEN c.Account IS NULL THEN 'P' ELSE 'C' END -- Either "C"urrent value or "P"rior value.
FROM(-- Get records from the Current month from the OriginalFact table.
SELECTAccount, Organization, [Year], Period, Amount
FROMOriginalFact
WHERE[Year] = @Year
ANDPeriod = @Period
) c -- Current
FULL JOIN (-- Get records from the Prior month.
SELECTAccount, Organization, [Year], Period, Amount
--FROMOriginalFact -- Use OriginalFact if values missing last month should NOT be included this month. See Notes **
FROMDerivedFact -- Use DerivedFact if values missing last month SHOULD be included this month.
WHERE[Year] = CASE WHEN @Period = 1 THEN @Year - 1 ELSE @Year END
ANDPeriod = CASE WHEN @Period = 1 THEN 12 ELSE @Period - 1 END
) p -- Prior
ONp.Account = c.Account
ANDp.Organization = c.Organization
SELECT*
FROMDerivedFact
ORDER BY Account,
Organization,
[Year],
Period
/* Notes
** CMP1000 was missing for both 2001/12 and 2012/01. If the values from 2011/11 should be
rolled forward into 2011/12 and again into 2012/01, then use DerivedFact because it will
contain records for CMP1000 - CMP1003. If CMP1000 should not be included in 2012/01 because
it was not in 2011/12 or in 2012/01, then use OriginalFact because it will contain only
CMP1001 - CMP1003 for the prior month.
*/
January 18, 2012 at 12:42 am
Hi fahey.jonathan,
Based on your sample data here:
('CMP1000', 'PM1',2011,11,100),
('CMP1001', 'PM1',2011,11,101),
('CMP1002', 'PM1',2011,11,102),
('CMP1003', 'PM1',2011,11,103),
('CMP1001', 'PM1',2011,12,111),
('CMP1002', 'PM1',2011,12,112),
('CMP1003', 'PM1',2011,12,113),
('CMP1002', 'PM1',2012,01,122),
('CMP1003', 'PM1',2012,01,123)
The desired output should be:
('CMP1000', 'PM1',2011,11,100),--Row A
('CMP1001', 'PM1',2011,11,101),--Row B
('CMP1002', 'PM1',2011,11,102),
('CMP1003', 'PM1',2011,11,103),
('CMP1001', 'PM1',2011,12,111),
('CMP1002', 'PM1',2011,12,112),
('CMP1003', 'PM1',2011,12,113),
('CMP1000', 'PM1', 2011, 12, 100),--Copy from Row A
('CMP1002', 'PM1',2012,01,122),
('CMP1003', 'PM1',2012,01,123),
('CMP1000', 'PM1',2012,01,100),--Copy from Row A
('CMP1001', 'PM1',2012,01,101),--Copy from Row B
Originally, the DerivedFact table does not exist in the database. I only translate the original fact table to a derived view which is the DerivedFact table now. Sorry for the confusion.
Thanks.
January 18, 2012 at 7:33 am
Two questions:
1 - Are you saying that DerivedFact is a view, not a table?
2 - Why is CMP1001 copied from two months prior (2011/11) rather than one month prior (2011/12)? You have the 2011/11 row labeled as Row B; I expected the 2011/12 version to be labeled as Row B. I thought the goal was to bring the prior month forward if it existed, which for 2012/01 would be 2011/12.
January 18, 2012 at 8:15 am
Hi,
Apologized for my mistake. The output in the views should look like this:
('CMP1000', 'PM1',2011,11,100),
('CMP1001', 'PM1',2011,11,101),
('CMP1002', 'PM1',2011,11,102),
('CMP1003', 'PM1',2011,11,103),
('CMP1001', 'PM1',2011,12,111),
('CMP1002', 'PM1',2011,12,112),
('CMP1003', 'PM1',2011,12,113),
('CMP1000', 'PM1', 2011, 12, 100),
('CMP1002', 'PM1',2012,01,122),
('CMP1003', 'PM1',2012,01,123),
('CMP1000', 'PM1',2012,01,100),
('CMP1001', 'PM1',2012,01,111),
The goal was to always bring the prior month forward if it existed. Sorry for my mistake.
DerivedFact is a view, not a table because the DerivedFact originally is not in the database tables.
Thanks.
January 18, 2012 at 9:00 am
I'm going to suggest a different strategy because of the complexity of the need. You need to bring forward each row from the prior month, meaning that if a row is missing for three months, you either need a recursive call to roll it forward one month at a time, or a cross join to get the last prior available record. Both of those are going to be fairly expensive, especially with a large data set. If you will be using this data with some frequency, that will probably be taxing on the server to run those queries every time.
I recommend that you add a column to your fact table that says whether the record is a base fact or a derived fact. Once a month when you add data, insert the base facts marked with a B ("base") or F ("Fact"), etc. Then insert the derived facts from the prior month where applicable, marked with a D ("Derived"), etc. This way, when using the fact table, there are no calculations that need to be done and your derived facts are indexed along with the base facts, so queries should be fast.
If you need a version of the table that contains only base facts, create a view for it by selecting only records with a B (or F, etc).
January 19, 2012 at 5:12 pm
How many rows are in the base fact table? What will the growth rate be? What is the maximum number of periods that will be stored in the fact table?
January 19, 2012 at 9:10 pm
Hi JonFox,
Based on two periods that I had loaded in the fact table, I can estimate that the fact table has around 150,000++ record. I would say that the database size growth is quite huge every month.
Thanks.
January 21, 2012 at 9:10 am
How wedded are you to the schema of your base fact table? Is this a work in progress that you can make some changes to, or is it already in production? The reason I ask is because I think there are some changes you might want to consider. When designing a dimensional schema, data such as the period and year would normally be a reference to a date dimension's surrogate key, rather than being stored directly in the fact table. This would apply to account/organization info as well. I would definitely recommend reading one of the Kimball Group books such as The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling in order to really get a handle on dimensional modeling techniques.
While playing around with the sample data, I ended up adding a new DATE column to the OriginalFact table, to store the year/period in a way that is easier to index and compare against. Again, I would really recommend that this field actually be an attribute on a Date dimension, but this is a start. Since we have to sort and filter by year/period in order to get the most recent Amount for each Account/Date, this gives us a good value that can be indexed.
Here's a potential solution using CROSS APPLY:
-- If the temp table with sample data already exists, drop it
IF OBJECT_ID('tempdb..#OriginalFact') IS NOT NULL DROP TABLE #OriginalFact;
-- Create temp table to hold sample data
CREATE TABLE #OriginalFact
(
AccountVARCHAR(10)NOT NULL,
OrganizationVARCHAR(10)NOT NULL,
[Year]INTEGERNOT NULL,
PeriodSMALLINTNOT NULL,
AmountINTEGERNOT NULL,
PeriodStartDATE
)
-- In order for the technique below to be anywhere close to efficient, we must have a good index in place!
CREATE INDEX Account_Organization_PeriodStart ON #OriginalFact (Account, Organization, PeriodStart DESC) INCLUDE (Amount)
-- Load sample data
INSERT
INTO #OriginalFact
(Account, Organization, [Year], Period, Amount)
SELECTAccount, Organization, [Year], Period, Amount
FROM(VALUES
('CMP1000', 'PM1',2011,11,100),
('CMP1001', 'PM1',2011,11,101),
('CMP1002', 'PM1',2011,11,102),
('CMP1003', 'PM1',2011,11,103),
('CMP1001', 'PM1',2011,12,111),
('CMP1002', 'PM1',2011,12,112),
('CMP1003', 'PM1',2011,12,113),
('CMP1002', 'PM1',2012,01,122),
('CMP1003', 'PM1',2012,01,123)
) x (Account, Organization, [Year], Period, Amount)
-- Set the value for the PeriodStart date
UPDATE #OriginalFact SET PeriodStart = dateadd(month,((Year-1900)*12)+Period-1,0)
-- Common Table Expression using a CROSS JOIN to generate a set containing all
-- Accounts/Organizations for each Year/Period. Without doing this, we would
-- wind up with "holes" in the end result.
;WITH PeriodsAccounts_CTE
AS
(
SELECT [Year], Period, PeriodStart, Account, Organization
FROM
(-- I think doing the distincts before the cross join is more efficient
-- than doing it after; not 100% sure on this, bears testing with a
-- larger data set.
SELECT DISTINCT [Year], Period, PeriodStart
FROM #OriginalFact
) AS D CROSS JOIN
(
SELECT DISTINCT Account, Organization
FROM #OriginalFact
) AS A
)
-- Main SELECT statement
SELECT PA.[Year], PA.Period, PA.Account, PA.Organization, F.Amount,
CASE WHEN PA.PeriodStart = F.PeriodStart THEN 'Original'
ELSE 'Derived'
END AS FactType-- Flag to indicate if the row was original or derived, in case that's helpful
FROM PeriodsAccounts_CTE AS PA CROSS APPLY
(-- For each row in PA, get all fact rows with the same account/org
-- that occurred on or before the current row's date.
-- Sort by date descending and grab the TOP 1 to get most recent.
SELECT TOP 1 Amount, PeriodStart
FROM #OriginalFact AS O
WHERE O.Account = PA.Account
AND O.Organization = PA.Organization
AND O.PeriodStart <= PA.PeriodStart
ORDER BY O.PeriodStart DESC
) AS F
Note the index; if this isn't built, performance of the CROSS APPLY will be absolutely abysmal. Even with the index in place, I suspect that this approach won't scale very well at the volumes you are working with. Still, it's worth a shot.
I do suspect, however, that (as suggested by a fahey.jonathan) you will ultimately need to pre-calculate these values as part of your ETL process. That's the good thing about warehoused data; lengthy, complex calculations can be run overnight and persisted for fast retrieval during the day. In this type of environment, implementing costly views is not likely to be in your best interest.
In any case, I'm interested in hearing how this performs with your larger data set.
February 8, 2012 at 9:12 pm
Hi JonFox,
I had tried execute your sql in my fact table which contains 123591 records.
The performance is quite good actually...took around 2 seconds to complete...but the results seems not quite right as it only returns 72854 rows...I was expecting more than 123591 rows being returned.
I will update my findings to you later.
Thanks.
February 9, 2012 at 1:24 am
JonFox (1/21/2012)
How wedded are you to the schema of your base fact table? Is this a work in progress that you can make some changes to, or is it already in production? The reason I ask is because I think there are some changes you might want to consider. When designing a dimensional schema, data such as the period and year would normally be a reference to a date dimension's surrogate key, rather than being stored directly in the fact table. This would apply to account/organization info as well. I would definitely recommend reading one of the Kimball Group books such as The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling in order to really get a handle on dimensional modeling techniques.While playing around with the sample data, I ended up adding a new DATE column to the OriginalFact table, to store the year/period in a way that is easier to index and compare against. Again, I would really recommend that this field actually be an attribute on a Date dimension, but this is a start. Since we have to sort and filter by year/period in order to get the most recent Amount for each Account/Date, this gives us a good value that can be indexed.
Here's a potential solution using CROSS APPLY:
-- If the temp table with sample data already exists, drop it
IF OBJECT_ID('tempdb..#OriginalFact') IS NOT NULL DROP TABLE #OriginalFact;
-- Create temp table to hold sample data
CREATE TABLE #OriginalFact
(
AccountVARCHAR(10)NOT NULL,
OrganizationVARCHAR(10)NOT NULL,
[Year]INTEGERNOT NULL,
PeriodSMALLINTNOT NULL,
AmountINTEGERNOT NULL,
PeriodStartDATE
)
-- In order for the technique below to be anywhere close to efficient, we must have a good index in place!
CREATE INDEX Account_Organization_PeriodStart ON #OriginalFact (Account, Organization, PeriodStart DESC) INCLUDE (Amount)
-- Load sample data
INSERT
INTO #OriginalFact
(Account, Organization, [Year], Period, Amount)
SELECTAccount, Organization, [Year], Period, Amount
FROM(VALUES
('CMP1000', 'PM1',2011,11,100),
('CMP1001', 'PM1',2011,11,101),
('CMP1002', 'PM1',2011,11,102),
('CMP1003', 'PM1',2011,11,103),
('CMP1001', 'PM1',2011,12,111),
('CMP1002', 'PM1',2011,12,112),
('CMP1003', 'PM1',2011,12,113),
('CMP1002', 'PM1',2012,01,122),
('CMP1003', 'PM1',2012,01,123)
) x (Account, Organization, [Year], Period, Amount)
-- Set the value for the PeriodStart date
UPDATE #OriginalFact SET PeriodStart = dateadd(month,((Year-1900)*12)+Period-1,0)
-- Common Table Expression using a CROSS JOIN to generate a set containing all
-- Accounts/Organizations for each Year/Period. Without doing this, we would
-- wind up with "holes" in the end result.
;WITH PeriodsAccounts_CTE
AS
(
SELECT [Year], Period, PeriodStart, Account, Organization
FROM
(-- I think doing the distincts before the cross join is more efficient
-- than doing it after; not 100% sure on this, bears testing with a
-- larger data set.
SELECT DISTINCT [Year], Period, PeriodStart
FROM #OriginalFact
) AS D CROSS JOIN
(
SELECT DISTINCT Account, Organization
FROM #OriginalFact
) AS A
)
-- Main SELECT statement
SELECT PA.[Year], PA.Period, PA.Account, PA.Organization, F.Amount,
CASE WHEN PA.PeriodStart = F.PeriodStart THEN 'Original'
ELSE 'Derived'
END AS FactType-- Flag to indicate if the row was original or derived, in case that's helpful
FROM PeriodsAccounts_CTE AS PA CROSS APPLY
(-- For each row in PA, get all fact rows with the same account/org
-- that occurred on or before the current row's date.
-- Sort by date descending and grab the TOP 1 to get most recent.
SELECT TOP 1 Amount, PeriodStart
FROM #OriginalFact AS O
WHERE O.Account = PA.Account
AND O.Organization = PA.Organization
AND O.PeriodStart <= PA.PeriodStart
ORDER BY O.PeriodStart DESC
) AS F
Note the index; if this isn't built, performance of the CROSS APPLY will be absolutely abysmal. Even with the index in place, I suspect that this approach won't scale very well at the volumes you are working with. Still, it's worth a shot.
I do suspect, however, that (as suggested by a fahey.jonathan) you will ultimately need to pre-calculate these values as part of your ETL process. That's the good thing about warehoused data; lengthy, complex calculations can be run overnight and persisted for fast retrieval during the day. In this type of environment, implementing costly views is not likely to be in your best interest.
In any case, I'm interested in hearing how this performs with your larger data set.
* Double post - pls ignore this reply.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply