In the course of developing an enterprise data warehouse I found a pairing of methodologies and tools that went together like chocolate and peanut butter. The organization had given us a mandate to create a new version of our data warehouse with aggressive timelines. Additionally data needed to be sourced from a diaspora of data stores. We needed to be agile, and planned on using sprints to parcel out the work.
Over the course of the project, traditional dimensional modeling became challenging. The data for a singular business concept could be coming from any number of source systems in the enterprise. For example there were several systems that described an employee and their attributes. This could have been overcome but there was a nagging feeling that a static solution for this particular problem would quickly be outgrown. As business needs and systems changed we stood in a position where we would have to continually revise our ETL for person data to accommodate new structures.
About this time I happened to hear about the Data Vault methodology and how it could be applied to stage data. Admittedly there is much more that can be done with the Data Vault approach beyond simple staging. For brevity I will focus on the ability of the vault to aggregate data from separate systems to be later leveraged by a dimensional model.
At a very high level the Vault methodology uses three types of tables to model data. The first and easiest to conceptualize is a hub. Hubs are entities within the model, think nouns. The second type is a link table. It describes the relationship between hubs, like a mapping table. And finally there are satellites. These are tables that are hung off of the hubs and links to further describe an entity or its relationships.
There is a lot more to read about the Data Vault by Dan Linstedt. As creator of the Data Vault methodology he has several books, articles and training available on how to leverage the Data Vault methodology.
In our person or employee example, we built a hub for a that person we can hang satellites from. This allows us to decorate the hub with the different sources of person data, and is extensible. In the future we can add additional satellites for add additional attributes from new sources as well. By using staging data in this manner we had future proofed our design. Furthermore the links that describe that person’s relationships to other hubs could be extended with additional satellites on the respective links.
Now that we had a design approach that was going to allow for some extensibility in staging, we needed a way to populate tables. In researching the Data Vault I found mention that a large portion of the ETL from the Vault to a traditional dimensional model could be created programmatically. Since my hubs had foreign key relationships to their satellites, by exploring the structure of the model and applying some rules I could create dimensions from hubs and facts from links.
Next we wanted to maintain the extensibility of staging and pull that forward to the population of our dimensional model. We needed to rapidly develop and deploy ETL code.
I had just heard about BIML from a talk with Andy Leonard (b | t). In a discussion of his incremental load pattern Mr. Leonard described how with some programmatic discovery he could create multiple SSIS packages with ease. By wiring up BIML to a source and a destination the BIML could create as many packages as there were staging and target tables. Through a demo he showed how traditional SSIS development efforts could be largely scripted for simple packages.
So with my extensible Data Vault model for staging to draw from, and a dimensional model set for the ultimate destination, work began with BIML. BIML allowed us to use the structure of objects in SQL Server to largely create our ETL. By using views on top of the vault model to transform our person hub with satellites into a dimension BIML could create an incremental load package to feed the final model.
Eventually there were scripts to create views for the hubs and links that aggregated the satellite data and BIML templates that pushed data from the views to the final dimensional model. With this level of automation we were able to refactor models with impunity. Our ETL was refactorable at the push of a button, err two buttons. One for stage and one for dimensional models. The project evolved into one of conceptual modeling where some of the more mundane ETL development tasks got handled for us. We were able to iterate over models several times in a sprint. Sprints got focused around business processes and the technologies or systems that supported them. With the extensibility of the vault and the ability to refactor code we attained a level of agility that was admirable and the methodology and tool paired very nicely. The end result was a system and approach that was as good as chocolate and peanutbutter.