Defining the Problem
Imagine I have two different systems - say, a CRM and an ERP - on different SQL Servers. Both systems store some attributes about widgets, but neither can provide the complete picture by itself.
Lots of consumers want to interact with widgets holistically - that is to say, they need to know about attributes defined in both of those sources - which often results in nasty solutions like federated (linked server) queries in real time. The problem is extensive enough that we decide to consolidate both sources into a single database which is updated in near-real-time.
This database will be a kind of operational data store (ODS), which defines a canonical data model (CDM) composed of some transformed subset of the desired attributes from all of the upstream systems. Not only does this allow consolidation of the disparate source data, it also lets us make our own enterprise data model, eliminating design defects in the source systems, removing redundancies and ambiguities, and making it easier for our developers to interact with a model that matches our enterprise instead of matching some vendor's idea of a generic enterprise.
But how will I populate the ODS in near real time? That's something I am currently still working on, and where I've gotten to so far is going to be the subject of my post today. I expect that the architecture I describe might be a little different from the normal "near real time ETL" designs out there.
Creating dummy source systems
Let's create a bit of schema to represent our source systems. I'm going to create two schemas in the one database for simplicity:
-- create schemas to act as our two source systems create schema ERP go create schema CRM go -- source data -- for demo simplicity we will assume that the ERP and CRM share id's, -- ie, there is only one master system for widget creation create table ERP.Widgets(id int primary key, a char, b char) go create table CRM.Widgets(id int primary key, c char, d char) go
Defining the common model
Now let's move over to the ODS and create the common model for widgets.
My common model isn't just the union of all attributes from the sources, I'm going to add a little transformation. For my demo I've decided I don't need the "b" column from the ERP, and the "a" column data should be uppercase. I'm also going to add a couple of metadata columns.
I'll create my ODS Widget table in a different database so that I can incorporate replication later (which requires that publisher and subscriber be on different databases), but I'll create an ODS schema too just to make it obvious where each object "belongs".
use ODS go -- the ODS common model for Widgets create table ODS.Widgets(id int primary key, upper_a char, c char, d char, inserted_dt datetime default getdate(), updated_dt datetime) go
An interface for the T and the L of ETL. We are going to be agnostic of what's doing the E part
OK, so how should we move data from the source Widgets tables into our common model? For now let's ignore the "E" part of ETL, and focus on the T and the L.
I want my transforms and loads to be efficient. This means I want to limit IO. In a previous post I talked about the inefficiency of message oriented architectures, so with that in mind I want to provide a loading interface that can process sets of data where available. Maybe I receive a single tuple, maybe I receive ten-thousand at once. I want the interface to be the same in both cases.
Since I'm working in sets, I'd also like to do those transforms using TSQL, because that's the kind of thing TSQL is good at. Some might ask "what about SSIS"? I use SSIS a great deal in my job, but even after nearly 10 years with SSIS I still find TSQL to be much more powerful, intuitive and maintainable. This is particularly the case when your transforms aren't simple scalar operations, and often even with they are (SSIS still can't even format a date without a ridiculous amount of expression code).
Aside: Transforming data for point to point integration
I'd like to take a moment here to talk a little bit about the general problem of transforming data from one system to fit another - not necessarily in the context of a data warehouse or ODS, but as a general point-to-point integration problem. The kind of problem that SSIS is generally designed to solve.
Transformation can certainly be done in SSIS, but I believe there is generally a better approach, and I want to explain why.
Let me describe two different categories of transform. The first class is the kind where you have a tuple coming out of a source system and you need to perform some kind of scalar transform that depends only upon the data in that tuple. For example, I might have an invoice line tuple containing line_total_amt and tax_amt, and I want to write to a system that has a tax_exclusive_amt column. This is a simple scalar transform that depends only in the data in the tuple. I can add a derived column that simply substracts the tax_amt from the line_total_amt. SSIS can do this almost as easily as TSQL.
But there's a second category of transform that I'll simply define as "anything not in the first category". We might, for example, have a requirement like: "Send accounts from the ERP system to the CRM system. If the account has any overdue invoices, write the value 'overdue' to the account_status column in the CRM accounts table". This, again, is the creation of a derived column, but it's one that is much easier to construct in TSQL than it is to use lookup transforms in SSIS. And if you need several transfoms of this sort then the SSIS package gets more and more unwieldy. In TSQL, it's just another join in your query.
It's for this reason that I generally advocate a T.E.L. pattern for point to point integration. Write views, procedures of functions in your source systems that do as much transformation as possible. Have these queries expose the data in the shape expected by the destination. This includes things like collation settings. SSIS becomes a simple data pump, selecting from these source objects and writing to destination objects.
The fact that SSIS provides a horrible UI for editing data source queries (tiny window, no syntax higlighting, variable width font) only reinforces my suggestion to put the transforms into the source system and use SSIS as a simple pump.
In my enterprise many of the source databases are proprietary, with schemas we can't modify. So we need somewhere else to create those tranform-query objects. My solution has been to create a database called "integration" on each instance. Inside this database is a schema for any destination system requiring data from that source. The transform queries live in that schema. You could of course create an integration database "sibling" for every source database. This makes it easier to move systems between servers... just pick up the system DB and the entire sibling database, rather than having to migrate objects from one integration DB to another. But it does mean having a lot more databases.
For example, say my CRM system on Server2 want to get Accounts from the ERP system on Server1. In this scenario, the transform query would live in an object called Server1.integration.CRM.Accounts.
I call this the "integration database pattern".
We can't use this pattern here, though, because we're not doing point to point integration. We're combining data from multiple sources into a single destination. Transforms may depend on cross-system information
Now here's the 60 thousand dollar question: How can I create an interface that is agnostic of what is writing to it, can receive a set of data, transform it to a different shape, and deliver the result to the ODS.Widgets table in a way that minimizes IO?
We could use stored procedures that take table-valued-parameters as arguments. But if we ever want to write data to the interface using SSIS that's going to be annoying (unless you really like creating custom data flow destinations), and it's just harder to develop against in general. Wouldn't it be nice if we could just issue regular insert, update and delete commands?
How about I create some views with instead-of triggers on them? This gives me a standard relational interface - it looks like a table as far as anyone is concerned - but it means I don't actually have to write to staging tables, then read back from those staging tables, do the transforms, and write the data to the common model. A single operation will do both the transform and the load without having to be written to any intermediary.
I'll create views to act as the "landing pad" for the data from each of the source systems. To get started, I'll create schemas inside the ODS database to contain these views. Each source system will have its own "landing pad" schema:
-- these schemas will act as the surface area for writing to the ODS database create schema ERP go create schema CRM go
And for each source system table, I'll create a landing pad view that matches the schema of the source in terms of columns.
Wait a minute - is inserting to views with instead-of triggers actually faster staging the data and then doing the transform?
Before committing to this path, it's worth asking the question: Which performs better?
- Write to a staging table. Select back out of that table (either in "after" triggers, or a procedure), do a transform, and write to the final common model table
- Write to a view and invoke an instead-of trigger to transform the data and redirect the incoming rows to the final common model table
The answer, as you might expect, seems to be "it depends". What it depends on exactly I haven't yet fully investigated. But note that the mechanics are also going to be slightly different. If you always write to a staging table first, then you have to worry about things like primary key violations. If you try to insert the same row twice, for example, the insert will fail before the merge trigger or procedure has a chance to switch the operation to an update. With an "instead-of" trigger this never comes up. Plus, of course, if you're actually writing to a staging table then you're consuming more space in your ODS database. But storage is cheap, as they say, so that might not be a concern.
Creating the landing-pad objects
But views have to select something, so what are we selecting?
I could select a single row of literal nulls cast to the correct datatype, but there's an arguably more useful approach. Select from the common model of course! So reading from these views to some extent "untransforms" the data in the common model back to the source model. We can't always do a full reverse transform. For example, I can't restore the original case of the source data once it's been transformed to uppercase. That original case information is lost. But I can provide meaningful output where it's sensible to do so.
I'm also going to add an extra "metadata" column to each interface view that doesn't exist as an attribute in the source entity: a "delete_flag" column that will make it easier to work with certain kinds of data - specifically, data coming from CDC-like sources. Don't worry about this column for now, it will all make sense in a moment:
-- The columns being output by the views match the columns that will come from the source, -- that way performing writes from the source to the destination is very straightforward to develop, just line up the column names! -- I'll add a "delete_flag" column so we can insert rows but process them as if they were deletes. -- This will be super handy when working with SSIS or CDC as upstream components. create or alter view ERP.Widgets as select id = id, a = upper_a, delete_flag = cast(null as bit) from ODS.Widgets go create or alter view CRM.Widgets as select id = id, c = c, d = d, delete_flag = cast(null as bit) from ODS.Widgets go
I could of course create staging tables with instead-of triggers on them rather than using views. Those tables would never actually contain any data, it would always be redirected by the instead-of trigger(s). But by using views these objects are a bit more useful. Also, using tables will only work with "cdc-like" input data, because updates and deletes won't have anything in the inserted or deleted tables inside the triggers. More about that in a second...
A "design principle" objection might be raised here, and it's one which niggles me a little. The interface view purports to be a view that returns data in the shape that it came from the source system. But it doesn't really do this. There are a lot of cases where transforms can't be reversed by the code in the view - like setting all uppercase, or merging two columns into one. So these views are doing a bit of false advertising. That's bad for maintainability.
If you feel like the problem of false advertising outweighs the benefit of providing "somewhat sensible" output from the view, I suggest an alternative: Have the view return only key attributes and metadata columns, and literal null values for all other columns it receives as input. Something like this:
create or alter view CRM.Widgets as select id = id, c = cast(null as char), d = cast(null as char), delete_flag = cast(null as bit), inserted_dt = inserted_dt, updated_dt = updated_dt from ODS.Widgets go
Now the view doesn't try to untransform anything, but still provides some useful output. Of course if you use tables as the landing pads then this problem doesn't come up, but you are left with a weird object in your database - a table that will never have any data in it.
So, here's how this is going to work in principle: We're going to take changes from the source system and push those changes into the ODS using these views. For example, if a row is inserted into the Widgets table in the CRM system, we will issue an insert against the CRM.Widgets view in the ODS database. The triggers on the view will perform the transform to the common model and redirect the insert to the ODS.Widgets table.
So, why the deleted_flag column? Well, there are two ways a source system might want to communicate with the ODS. One way would be to issue an instruction that says "These rows were deleted at my side, so perform the same delete at your side". That's the "obvious" way - and that's how replication will do it.
But if you're using CDC things work a bit differently. When a row is inserted, updated or deleted at the source system a row will be inserted into a change table, with a column to indicate which kind of operation happened.
We could read from the CDC data, split the output into three sets based on the __$operation column that CDC provides, and perform three different kinds of operations against the view. But there's a "simpler" way: We can just insert everything from the CDC function into the view, including the column indicating the operation type, and write code in the trigger(s) to deal with each case appropriately.
As it turns out the only "type" of operation we need to explicitly nominate is a delete at the source. For inserts or updates at the source, the operation we need to perform in the ODS depends on whether the row already exists in the ODS - remember, multiple systems are writing data to the ODS! As such, inserts at the source could become updates in the ODS, or updates at the source could become inserts at the ODS. If we are clever with our merge code it doesn't matter!
Here's one way of writing the instead-of triggers. I'm using one trigger to handle all operations just for cooless in this article, but in reality I'd probably split off the instead-of-delete trigger to improve the efficiency of the conditionals
-- Im going to use one big fancy "do everything" trigger just for demo fun. It is simpler to have a separate trigger -- for deletes - and more efficient too no doubt, but more code. -- I will discard the "b" column, and am transforming the "a" column to uppercase -- Note that whether the original operation on the view was an insert or an update is actually not relevant. -- If inserted.id is not null we are doing an insert OR an update. We will choose between these two options -- based solely on whether we already have that row in the ODS! create or alter trigger ERP.Widgets_transform on ERP.Widgets instead of insert, update, delete as begin set nocount on merge ODS.Widgets tgt using ( select id = coalesce(i.id, d.id), iid = i.id, did = d.id, a = i.a, delete_flag = isnull(i.delete_flag, 0) from inserted i full join deleted d on i.id = d.id ) src on src.id = tgt.id when not matched by target and src.iid is not null and src.delete_flag = 0 then -- not in CDM, not a delete operation. insert (id, upper_a) values (src.id, upper(src.a)) when matched and src.iid is not null and src.delete_flag = 0 then -- in CDM, not a delete operation. update set tgt.upper_a = upper(src.a), tgt.updated_dt = getdate() when matched and (src.iid is null or src.delete_flag = 1) then -- in CDM, delete operation. delete; end go -- no transforms here, just moving the CRM columns into the CDM Widgets create or alter trigger [CRM].[Widgets_transform] on [CRM].[Widgets] instead of insert, update, delete as begin set nocount on merge ODS.Widgets tgt using ( select id = coalesce(i.id, d.id), iid = i.id, did = d.id, c = i.c, d = i.d, delete_flag =isnull(i.delete_flag, 0) from inserted i full join deleted d on i.id = d.id ) src on src.id = tgt.id when not matched by target and src.iid is not null and src.delete_flag = 0 then -- not in CDM, not a delete op. insert (id, c, d) values (src.id, src.c, src.d) when matched and src.iid is not null and src.delete_flag = 0 then -- in CDM, not a delete op. update set tgt.c = src.c, tgt.d = src.d, tgt.updated_dt = getdate() when matched and (src.iid is null or src.delete_flag = 1) then -- in CDM, delete op. delete; end go
What's been achieved here?
What I now have is a pattern which receives tuples from source systems, in the shape of those source systems. The interface can take single rows, or whole sets of data. It can be fed data from CDC-like, or replication-like sources. You use even use the landing-pad views as data-flow destinations in SSIS, or send from a RBAR message-oriented source (which is basically, and unfortunately, what SQL replication does). The interface applies transforms to the incoming data, writing it to a common data model in an ODS. The source systems don't need to know anything about the data in the ODS, the transforms always handle the rows appropriately depending on the current state of the ODS. The pattern doesn't stage the incoming data anywhere before reading it back out and transforming it to the final model, which saves on IO. Transforms are done in a set-based manner using TSQL.
But there's a problem with this design: It can't actually act as a receiver for operations coming from SQL replication! If you try to take a source table and replicate it, setting the view as the target object, replication won't work. After working through several roadblocks I have gotten to the point where replication complains that "alter table cannot be used against a view" when trying to init. I've been trying to figure out if there's a way to set up replication options that would allow setting the replication target to be a view, but have so far been unsuccessful.
Maybe replacing the interface views with the (somewhat less fancy) empty-table solution will work with replication. But that's going to involve some trickery, because as I wrote earlier, empty tables won't have contents in the inserted or deleted tables for update or delete operations inside the triggers. That's what I'm going to be working on next. I will come back with another blog entry once I know more!