A key component of the extract, transformation, and load (ETL) process is the cleanup, or transformation, of data while in transition from one place to another. In most cases there are restrictions - regulatory, company policy or otherwise - that require any data that is modified in the ETL pipeline to be audited as such. Even if you are not bound by auditing requirements, it is a wise practice to build an audit mechanism for any data you update or exclude as part of the ETL process. Years down the road if you or your successor is quizzed as to the reasoning certain transformations, an invaluable tool in answering those questions will be an auditing file or table. In addition, I have found that users will occasionally request access to the unmodified data as it was originally extracted from the source system, and an audit table can be useful in providing that information as well.
In this article, I'm going to demonstrate how you can include a simple auditing mechanism in your SQL Server Integration Services (SSIS). We're going to extract some data from a source table and check one of the columns against a lookup table; if no match is found, we will add a default value to the output and write that row to an audit table to indicate that we updated the column as part of the ETL process.
To get started, let's create our source table. The following code will create and populate the source table, and will create the empty destination table.
-- Create the source table
CREATE TABLE OldCustomerDemographics
(
CustomerID INT IDENTITY(1,1)
, CustFirstName VARCHAR(50) NOT NULL
, CustLastName VARCHAR(50) NOT NULL
, CustSalesRegion VARCHAR(20) NULL
, CustStatus VARCHAR(20) NOT NULL
)
-- Create the destination table
CREATE TABLE NewCustomerDemographics
(
CustomerID INT
, CustFirstName VARCHAR(50) NOT NULL
, CustLastName VARCHAR(50) NOT NULL
, CustSalesRegion VARCHAR(20) NULL
, CustStatus VARCHAR(20) NOT NULL
)
-- Populate the source table
INSERT OldCustomerDemographics VALUES ('Jim', 'Halpert', 'SOUTHWEST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Michael', 'Scott', 'NORTHEAST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Dwight', 'Schrute', 'WEST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Jan', 'Levinson', NULL, 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Pam', 'Beesly', 'EAST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Andy', 'Bernard', 'CENTRAL', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Angela', 'Martin', 'SOUTHEAST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Toby', 'Flenderson', 'MIDWEST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Ryan', 'Howard', NULL, 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Creed', 'Bratton', 'SOUTHWEST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Kevin', 'Malone', 'NORTHEAST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Phyllis', 'Lapin', 'WEST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Darryl', 'Philbin', 'EAST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Oscar', 'Martinez', 'CENTRAL', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Kelly', 'Kapoor', 'SOUTHEAST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Meredith', 'Palmer', 'MIDWEST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Stanley', 'Hudson', 'NORTH', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Karen', 'Filippelli', 'NORTH', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Bob', 'Vance', NULL, 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('David', 'Wallace', 'NONE', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Josh', 'Porter', 'CENTER', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Holly', 'Flax', 'SOUTHWEST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Todd', 'Packer', 'NORTHEAST', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Carol', 'Stills', 'CENTRAL', 'ACTIVE')
INSERT OldCustomerDemographics VALUES ('Jim', 'Halpert', 'SOUTHWEST', 'ACTIVE')
The following statement will create the audit table. As you will see, we are adding a column for tracking the updated sales region value.
-- Create the Audit table - note the additional column
CREATE TABLE OldCustomerDemographics_AUDIT
(
CustomerID INT
, CustFirstName VARCHAR(50) NOT NULL
, CustLastName VARCHAR(50) NOT NULL
, CustSalesRegion VARCHAR(20) NULL
, CustStatus VARCHAR(20) NOT NULL
, AUDIT_NewSalesRegion VARCHAR(20) NULL
)
Next, we'll build and populate the lookup table. In our example, we will standardize the customer's sales region in the CustSalesRegion column.
CREATE TABLE NewSalesRegions
(
RegionID INT IDENTITY(1,1)
, RegionCode VARCHAR(20) NOT NULL
)
INSERT NewSalesRegions VALUES ('NORTH')
INSERT NewSalesRegions VALUES ('NORTHEAST')
INSERT NewSalesRegions VALUES ('NORTHWEST')
INSERT NewSalesRegions VALUES ('SOUTHEAST')
INSERT NewSalesRegions VALUES ('SOUTHWEST')
INSERT NewSalesRegions VALUES ('SOUTH')
INSERT NewSalesRegions VALUES ('EAST')
INSERT NewSalesRegions VALUES ('WEST')
INSERT NewSalesRegions VALUES ('CENTRAL')
Now that the data is in place, we can create the SSIS package to perform the data manipulation. Here I've created an SSIS package with a single Data Flow Task in the Control Flow pane, and in that data flow I am attaching an OleDB Source to the OldCustomerDemographics table we just created (Figure 1).
Figure 1
Now I bring over a Lookup Transformation and add it to the data flow. This transformation will attempt to match the CustSalesRegion column from our OldCustomerDemographics table with the RegionCode column in the NewSalesRegion lookup table (Figure 2). Those rows that are matched successfully will pass through, and rows from the former that are not matched in the latter will be sent to the Error Output. In our case, we will configure the Error Output to redirect rows (Figure 3).
Figure 2
Figure 3
We'll come back and revisit the rows that are matched, but for now let's explore the rows that do not match an entry in the lookup table. In our example we will update the unmatched rows so that they belong to the NATIONAL sales region. To properly audit this activity, we will send the rows updated with the old values as well as the new values to the audit table we created above. To do this, I've added a Derived Column transform to specify the new sales region value, along with a Multicast transform so we can send the data to both the audit table and back to the primary output table (Figure 4).
Figure 4
In the OleDB Destination for the audit table, I'm wired up to the new audit table (OldCustomerDemographics_AUDIT) we created above. You can see that I've sent the new sales region value to the audit column we created (Figure 5).
Figure 5
Now that takes care of our auditing, but we still need to UNION the updated rows with those that were matched in the Lookup Transformation. Using the UNION ALL transformation, I have brought these two data streams back into one. In Figure 6, you will see that I am using the NewSalesRegion value from the Derived Column for those rows that we have opted to update.
Figure 6
Lastly, we'll send the output from the UNION ALL transform to an OleDb Destination, which is wired to the NewCustomerDemographics table we created in an earlier step. You can see in Figure 7 that there were 7 rows sent through the audit/update pipeline in SSIS. If you open SSMS and query the OldCustomerDemographics_AUDIT table as shown in Figure 8, you will see the 7 updated rows along with both old and new sales region values.
Figure 7
Figure 8
In this brief example, we can see an effective methodology for auditing data updates that occur during your ETL processing in SQL Server Integration Services.