Incremental load and complex ETL

  • I am tasked with improving an ETL process I have inherited.  The target system is being called a data warehouse, but it was not designed with traditional DW fact/dimension architecture.  The ultimate goal is to bring data from multiple sources into this "warehouse", in a unified format for reporting and analysis purposes, but for the moment, I'm only working with one of the sources.

    The process as it stands does a full load on each run, truncating the target tables and reloading them.  The current goal is to convert this to an incremental load.  

    The part I'm having trouble wrapping my head around is that each target table is a combination of multiple source tables, which means using CDC for incremental loading is less than straightforward.  Right now, there are staging tables, each loaded from multiple source tables, and then the final destination is loaded from one or more source tables -- because there are some holes in the source data, loading the final tables sometimes must be done in multiple steps.

    Yeah, it's kind of a mess; I didn't design any of it. 

    What are my options here? Is it possible to work with the current structure to achieve an incremental load?  Can it be done with relatively minimal changes, like extra staging tables or a two-part staging process -- extract change data to first stage table, process into something like the current stage tables, then load the target tables from there?  Or do I need to just scrap everything and start over?

  • If your target tables are populated from multiple staging/source tables, whether or not you can use CDC really depends on the relationships between the source tables. For instance, if you have a transaction-like source table (like sales transactions) and a source table that contains attributes (like transaction types) you will be able to use CDC to filter the transaction source table but not the other. You'll have to do this on a case-by-case basis. 

    In theory you should be able to implement CDC (or some other filtering mechanism) for some tables and adjust the load process to incrementally update, but not having seen the entire process it is difficult to determine whether that is really a feasible solution or whether starting over is more appropriate. Part of the question is also what your future work will consist of...are you planning to do the same with the other data sources, and will the use of something like BIML allow you to create a more repeatable process that requires less future effort? It's better to look at it from a global perspective as that might change how you would address this one data source. 

    Hope this helps.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply