Recently I was asked to see if I could optimize a debt collection report system for a medium sized company (100k customers). The problem was that the daily overnight data processing procedure sometimes ran for hours, and it would not finish before the next morning. Attempts to get the data online usually caused even more problems, sometimes even invalid data records being inserted.
The system worked perfect, and fast, when it was first implemented, but that was when there was no, or nearly no, data. Now, after being in use for nearly two years, the main table contained 200 million rows. We quickly found some essential indexes were missing. Nobody had been checking on the database. What was more disturbing was that the system contained much more data than one would expect. Apparently the data processing procedure had created many useless, duplicate rows.
We decided to redesign the whole database, and reconsider the data processing procedure. In this article I will describe one of the essential elements of this new design: the implementation of a Type 2 Slowly Changing Fact (SCF) table. Although information on implementing such a table is available on the Internet, we ran into problems and had to combine several solutions. Therefore, I decided to write this article. I hope it helps you when you have to tackle similar problems.
The samples are written for SQL Server 2008 R2, but it should also work for other versions supporting the MERGE statement.
Requirements
Our debt collection system is required to store all unpaid invoices that are overdue and order them based on the number of days the invoice is overdue. The system should store the data of the individual invoices, provide a source for consolidated summary data, and supply data to generate reminder letters. Invoice data is provided by an ETL system overnight, where we have a limited time window to process the data.
As stated before we will discuss the implementation of a Slowly Changing Fact table in this article. Although interesting, the details of the debt processing procedure won't be discussed. The essential requirements for this part of the process are:
- The system should provide daily overdue invoice data for all customers.
- For each invoice there should be one record (one day state) defining it as paid.
- There is no need to further store information about paid invoices
- However, since errors in the ETL data frequently occur, it should be possible to 'reopen' an invoice once recorded as paid.
- Customers should be classified according to the maximum number of days their invoices are overdue.
- Data loads for the same day should be allowed to repeat, overwriting records inserted/updated for that day.
The existing implementation simply stores a copy of all overdue invoices for every day data processed in one large table. This table contains now 200 million, mostly identical, rows. In addition, our existing system contains too many records for paid invoices, which were incorrectly inserted by our data processing algorithm. This typically happens when the process was killed and restarted.
Implement the Invoices table as a Slowly Changing Fact
A SCF stores, similar to a Slowly Changing Dimension, a new record in the database when the data changes. In addition validity data is stored, indicating when a certain set of data is valid. Data for any given date can be retrieved as follows:
SELECT * FROM SCF_Invoices WHERE OnDate BETWEEN ValidFrom AND ValidTo
An excellent article discussing the different possible implementations of Slowly Changing Dimensions using the MERGE statement has been written by Adam Aspin in 2011. I will use his implementation of SCD type 2 as a starting point. But before that, I will introduce the tables used in this article
DI_Customer:
Contains Customer definitions
CREATE TABLE DI_Customer ( CustomerID int IDENTITY(1,1) NOT NULL, CustomerIDO int NOT NULL, -- Customer ID from the ETL system CustomerName varchar(50) NOT NULL, CustomerAddress varchar(500) OT NULL )
DI_Bucket:
This dimension is used to classify Customers based on the maximum number of days their invoices are overdue
CREATE TABLE DI_Bucket ( BucketID int IDENTITY(1,1) NOT NULL, BucketName varchar(50) NOT NULL, BucketFromDays int NOT NULL, BucketToDays int NOT NULL )
Some sample data
INSERT INTO [dbo].[DI_Bucket] ([BucketName], [BucketFromDays], [BucketToDays]) VALUES ('In time', 0, 30), ('OverDue', 31, 45), ('OverDue2', 46, 60), ('OverDue', 61, 70), ('OverDue4', 71, 89), ('Enforcement', 81, 999999)
SCF_Invoices:
The Slowly Changing Fact table.
CREATE TABLE dbo.SCF_Invoices( ( InvoiceID int IDENTITY(1,1) NOT NULL, InvoiceIDO int NOT NULL, -- Invoice ID from the ETL system CustomerID int NOT NULL, DueDate date NOT NULL, IsOpen bit NOT NULL DEFAULT 1, --Indicates that the Invoice was not (fully) paid yet BucketID int NOT NULL, OrigAmnt numeric(16,2) NOT NULL, --The original Invoice amount RemAmnt numeric(16,2) NOT NULL, --The remaining, unpaid, Invoice amount ValidFrom date NOT NULL, ValidTo DATE NOT NULL DEFAULT '99991231', IsCurrent BIT NOT NULL DEFAULT 0, CONSTRAINT [FK_SCF_Invoices_Customer] FOREIGN KEY (CustomerID) REFERENCES DI_Customer(CustomerID), CONSTRAINT [FK_SCF_Invoices_Bucket] FOREIGN KEY (BucketID) REFERENCES DI_Bucket(BucketID), )
In addition we will define a table valued function returning fact data for any given day. This TVF replaces the original fact table for retrieving daily data.
CREATE FUNCTION tvf_F_Invoices(@OnDate date) RETURNS TABLE AS RETURN ( SELECT InvoiceID, InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, @OnDate AS OnDate, ValidFrom, ValidTo, IsCurrent FROM SCF_Invoices WHERE @OnDate BETWEEN ValidFrom AND ValidTo )
Note that we did not define a primary key for the SCF table yet. This will be discussed later. I did, however, define two foreign key constraints on the Dimension tables. We will later see why.
Daily Invoice data will be supplied in a staging table. This table supplies information about all outstanding Invoices on a given day as well as Invoices paid during the previous month. This will make sure the procedure will get at least once a closing record for every paid invoice.
CREATE TABLE dbo.ST_Invoices( ( OnDate date NOT NULL, InvoiceIDO int NOT NULL, -- Invoice ID from the ETL system CustomerID int NOT NULL, DueDate date NOT NULL, OrigAmnt numeric(16,2) NOT NULL, --The original Invoice amount RemAmnt numeric(16,2) NOT NULL, --The remaining, unpaid, Invoice amount )
Define the SCF processing statement
Now we have all the information and can write our SCF processing procedure. First of all we have to define the select statement returning the source data as it should be merged in the SCF table.
SELECT st.InvoiceIDO, ISNULL(cus.CustomerID,-1) AS CustomerID, -- Return Unknown Customer (-1) when not matched st.DueDate, CASE WHEN st.RemAmnt > 0 THEN 1 ELSE 0 END AS IsOpen, ISNULL(buc.BucketID,-1) as BucketID, --Return Unknown (-1) when not matched st.OrigAmnt, st.RemAmnt FROM ST_Invoices st LEFT JOIN DI_Customer cus ON cus.CustomerIDO = st.CustomerIDO LEFT JOIN DI_Bucket buc ON DATEDIFF(dd,st.DueDate,st.OnDate) BETWEEN buc.BucketFromDays AND buc.BucketToDays WHERE st.OnDate=@OnDate
Implementing the SCF MERGE statement seems now straightforward with the instructions given by Adam Aspin:
CREATE PROCEDURE LOAD_SCF_Invoices AS BEGIN SET NOCOUNT ON DECLARE @Today date, @Yesterday date SELECT @Today = MAX(OnDate) FROM ST_Invoices SELECT @Yesterday = DATEADD(dd, -1, @Today) --The OUTER insert inserts a new valid record for all records flagged as no longer active INSERT INTO SCF_Invoices (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent) SELECT InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, @Today AS ValidFrom, CASE WHEN IsOpen=0 THEN @Today ELSE '99991231' END AS ValidTo, 1 FROM ( -- Merge statement MERGE INTO SCF_Invoices AS DST USING ( SELECT st.InvoiceIDO, ISNULL(cus.CustomerID,-1) AS CustomerID, -- Return Unknown Customer (-1) when not matched st.DueDate, CASE WHEN st.RemAmnt > 0 THEN 1 ELSE 0 END AS IsOpen, ISNULL(buc.BucketID,-1) as BucketID, --Return Unknown (-1) when not matched st.OrigAmnt, st.RemAmnt FROM ST_Invoices st LEFT JOIN DI_Customer cus ON cus.CustomerIDO = st.CustomerIDO LEFT JOIN DI_Bucket buc ON DATEDIFF(dd,st.DueDate,st.OnDate) BETWEEN buc.BucketFromDays AND buc.BucketToDays WHERE st.OnDate = @Today ) AS SRC ON (SRC.InvoiceIDO = DST.InvoiceIDO) AND (DST.IsCurrent = 1) -- This conditrion limits the MERGE to Current invoice records only, and can make use of a conditional index -- Insert a new record if the Invoice is not already present WHEN NOT MATCHED THEN INSERT (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent) VALUES (SRC.InvoiceIDO, SRC.CustomerID, SRC.DueDate, SRC.IsOpen, SRC.BucketID, SRC.OrigAmnt, SRC.RemAmnt, @Today, '99991231', 1) -- Check if an existing Invoice record has changed data WHEN MATCHED AND (SRC.IsOpen = 1 OR DST.IsOpen = 1) --Don't process changes on closed Invoices AND (DST.CustomerID <> SRC.CustomerID OR DST.DueDate <> SRC.DueDate OR DST.IsOpen <> SRC.IsOpen OR DST.BucketID <> SRC.BucketID OR DST.OrigAmnt <> SRC.OrigAmnt OR DST.RemAmnt <> SRC.RemAmnt ) -- When Invoice data changed, set the existing record as no longer active, and the validity period to end Yesterday THEN UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday OUTPUT SRC.InvoiceIDO, SRC.CustomerID, SRC.DueDate, SRC.IsOpen, SRC.BucketID, SRC.OrigAmnt, SRC.RemAmnt, $Action AS MergeAction ) AS MRG WHERE MRG.MergeAction = 'UPDATE'; END
Note the small differences compared with the procedure given by Adam Asppn
- The (IsCurrent = 1) condition should be moved from the IS MATCHED condition to the main merge search condition, this will enable the merge to use a conditional index
- Since we require that closed Invoices are not visible after the date they are recorded as IsOpen=0, ValidTo should be set to @Today, and not 9999-12-31 (infinite) as is usual for a Slowly Changing Dimension
- Because we are interested in Open documents mostly, there is no need to check for changes when both source and destination records have IsOpen=0, hence the additional IS MATCHED condition: AND (SRC.IsOpen = 1 OR DST.IsOpen = 1)
However, running this query will result in an error:
The target table 'dbo.SCF_Invoices' of the INSERT statement cannot be on either side of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement. Found reference constraint 'FK_SCF_Invoices_Customer'.
The error is caused by is a known bug in SQL Server. The problem is using the OUTPUT clause of a DML statement with the INSERT... SELECT syntax. Workarounds are either to not have foreign keys referencing the dimension table or to INSERT the results of the MERGE... OUTPUT into a temp table then INSERT into the actual dimension table.
I decided to use the latter workaround and create a temporary table. Temporarily dropping the Foreign Key constraints is not acceptable. The final load procedure is as follows:
CREATE PROCEDURE LOAD_SCF_Invoices AS BEGIN SET NOCOUNT ON DECLARE @Today date, @Yesterday date SELECT @Today = MAX(OnDate) FROM ST_Invoices SELECT @Yesterday = DATEADD(dd, -1, @Today) -- To avoid the error: The target table 'dbo.SCF_CASE_DETAILS' of the INSERT statement cannot be on either side of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement. Found reference constraint 'FK_SCF_CASE_DETAILS_CUSTOMER'. -- Insert in two steps. The outer insert will firs insert in a temporary table, then the temporary data is insert in [dbo].[SCF_CASE_DETAILS] -- Create the temporyry table CREATE TABLE #SCF_Invoices ( InvoiceIDO int NOT NULL, -- Invoice ID from the ETL system CustomerID int NOT NULL, DueDate date NOT NULL, IsOpen bit NOT NULL DEFAULT 1, --Indicates that the Invoice was not (fully) paid yet BucketID int NOT NULL, OrigAmnt numeric(16,2) NOT NULL, --The original Invoice amount RemAmnt numeric(16,2) NOT NULL, --The remaining, unpaid, Invoice amount ValidFrom date NOT NULL, ValidTo DATE NOT NULL DEFAULT '99991231', IsCurrent BIT NOT NULL DEFAULT 0, ) -- The OUTER insert inserts a new valid record for all records flagged as no longer active -- INSERT in the temporary table INSERT INTO #SCF_Invoices (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent) SELECT InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, @Today AS ValidFrom, CASE WHEN IsOpen=0 THEN @Today ELSE '99991231' END AS ValidTo, 1 FROM ( -- Merge statement MERGE INTO SCF_Invoices AS DST USING ( SELECT st.InvoiceIDO, ISNULL(cus.CustomerID,-1) AS CustomerID, -- Return Unknown Customer (-1) when not matched st.DueDate, CASE WHEN st.RemAmnt > 0 THEN 1 ELSE 0 END AS IsOpen, ISNULL(buc.BucketID,-1) as BucketID, --Return Unknown (-1) when not matched st.OrigAmnt, st.RemAmnt FROM ST_Invoices st LEFT JOIN DI_Customer cus ON cus.CustomerIDO = st.CustomerIDO LEFT JOIN DI_Bucket buc ON DATEDIFF(dd,st.DueDate,st.OnDate) BETWEEN buc.BucketFromDays AND buc.BucketToDays WHERE st.OnDate = @Today ) AS SRC ON (SRC.InvoiceIDO = DST.InvoiceIDO) AND (DST.IsCurrent = 1) -- This condition limits the MERGE to Current invoice records only, and can make use of a conditional index -- Insert a new record if the Invoice is not already present WHEN NOT MATCHED THEN INSERT (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent) VALUES (SRC.InvoiceIDO, SRC.CustomerID, SRC.DueDate, SRC.IsOpen, SRC.BucketID, SRC.OrigAmnt, SRC.RemAmnt, @Today, '99991231', 1) -- Check if an existing Invoice record has changed data WHEN MATCHED AND (SRC.IsOpen = 1 OR DST.IsOpen = 1) --Don't process changes on closed Invoices AND (DST.CustomerID <> SRC.CustomerID OR DST.DueDate <> SRC.DueDate OR DST.IsOpen <> SRC.IsOpen OR DST.BucketID <> SRC.BucketID OR DST.OrigAmnt <> SRC.OrigAmnt OR DST.RemAmnt <> SRC.RemAmnt ) -- When Invoice data changed, set the existing record as no longer active, and set the validity period to end Yesterday THEN UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday OUTPUT SRC.InvoiceIDO, SRC.CustomerID, SRC.DueDate, SRC.IsOpen, SRC.BucketID, SRC.OrigAmnt, SRC.RemAmnt, $Action AS MergeAction ) AS MRG WHERE MRG.MergeAction = 'UPDATE'; -- Finally insert the data from the temporary table INSERT INTO SCF_Invoices (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent) SELECT InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent FROM #SCF_Invoices END
Deleting data from the SCF table
Another requirement of the system is that data loads for the same day should be allowed to repeat. That means we have to delete, or invalidate, all changes made by the previous load. Of course, we cannot simply delete records from our SCF table. We also have to update the previous valid document and set IsCurrent=1, as well as change ValidTo.
To make things even more complicated the newly generated InvoiceID's could be referenced in another table. Deleting the newly inserted records would in that case result in a REFERENTIAL ERROR. To avoid referential errors I decided not to delete the records, but set their validity interval to a negative value, i.e. ValidTo < ValidFrom using the following code:
--DON'T DELETE, to avoid REFERENCE constraint conflicts, instead set the Validity of the record to an impossible value UPDATE SCF_Invoices SET ValidTo = @Yesterday, IsCurrent =0 WHERE ValidFrom >= @Today --Correct SCF records UPDATE SCF_Invoices SET ValidTo = CASE WHEN IsOpen = 1 THEN '99991231' ELSE ValidFrom END, IsCurrent =1 WHERE ValidFrom < @Today AND ValidTo >= @Yesterday
This piece of code should be inserted in the procedure LOAD_SCF_Invoices before the MERGE statement.
Indexes
The next thing to consider is which indexes to create. The standard approach is to create a PRIMARY KEY on the surrogate key InvoiceID. This is an ideal PRIMARY KEY, inserted records are always appended at the end of the table. It does, however, not provide and special advantages when querying the Slowly Changing Fact table.
In our case, we are usually querying the old Invoice table by Date:
SELECT * FROM F_Invoices WHERE OnDate=@OnDate
These references will typically be replaced by querying the table valued function we created before
SELECT * FROM tvf_F_Invoices(@OnDate)
The SCF condition - WHERE OnDate BETWEEN ValidFrom AND ValidTo
- strongly benefits from a covering index on ValidFrom. Especially when ValidFrom is close to the last Date for which data is available. This corresponds also with our business interest. We will usually query the latest Invoice data. Therefore I decided to create the following Index:
ALTER TABLE SCF_Invoices ADD CONSTRAINT PK_SCF_Invoices PRIMARY KEY CLUSTERED ( ValidFrom ASC, InvoiceIDO ASC )
Since data is added for consecutive days, this PRIMARY KEY also will ensure new records are always appended at the end of the table.
I also wanted to create an index to aid the MERGE statement. Since every Invoice in the SCF table always has one record with (IsCurrent = 1) and this record is theone which should be updated, it is enough to consider only records which fulfill this condition. Furthermore, since the Source and Destination are joined on InvoiceIDO the following conditional covering index strongly optimizes the MERGE since it will evaluate only Current records.
CREATE NONCLUSTERED INDEX IX_IsCurrent_InvoiceIDO_InclData ON SCF_Invoices ( ValidFrom ASC, InvoiceID ASC ) INCLUDE ( IsCurrent, ValidTo, InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt) WHERE(IsCurrent = 1)
Probably these indexes are not ideal. The optimal indexes also depend strongly on how the database is queried. I would be happy to hear if any of you has any better suggestions.
This concludes my story about the implementation of a Slowly Changing Fact table. And the result? I was able to reduce the Invoice table from the original 200 million records to 5 million. A substantial reduction; which also had its effects on the responsiveness of the database. The loading process for example, was reduced from 1.5 hours, originally, or exceptionally even several hours, to a mere 5 minutes in average!
A nice result I would say. I hope my story will help you to tackle similar problems.