July 18, 2008 at 6:03 am
I'll admit up front to being an opinionated old curmudgeon. But the more I learn about today's "ETL + Data Warehouse + Report Generator" fad (and I may well just not understand enough), the more skeptical I become.
I would have thought that by now enterprises would be developing and exploiting dynamically linked interconnected, heterogeneous systems with clearly defined, well documented authoritative data sources. Instead I find archaic batch mode data transfers, redundant data, and limited tools for inquiry and analysis. To encourage discussion, and convince me of the error of my ways, here are some of my opinionated concerns and biases, based on the Data Warehouse proposals I have been seeing from a parade of vendors, consultants and IT staff. Have I been disillusioned by poor design proposals, or is it the nature of the beast? Perhaps those of you who are proponents of these systems can dispel my concern.
[h1]Data Warehouses[/h1]
•A Data Warehouse is not a "new single integrated repository of enterprise data", it is simply yet another (and likely conflicting) repository.
•Data Warehouse implementations too often rely on source data based on flat file snapshot reports which do not reflect subsequent additions or retroactive corrections.
•Data Warehouses typically offer only "report based" access and screen-at-a -time inquiry. (Instead, the underlying data needs to be available for direct access and complex queries.)
•Data Warehouses offer limited or no support for ad hoc external data sources, e.g. to drive selections based on entries in an Excel workbook or external database.
•Data Warehouses and report generators are typically insular, and offer limited, if any, ability to work with "external" databases to do cross queries, joins, etc.
[h1]ETL Flaws[/h1]
[h2]Extract:[/h2]
•Only a subset of the source data (what gets left out)?
•The selected source data must be reloaded periodically (over and over again)?
o...via slow FTP connections?
o...in batch mode?
o ...as flat files?
[h2]Transform:[/h2]
•Neglect or mask anomalies in the source data? (ETL should not "cleanse" transaction system (system of record) actual data; instead it needs to work with actual system of record data with needed corrections and changes made at the source.)
•"Harmonizing" data from disparate systems (particularly decommissioned legacy systems) forces a "least common denominator" view of data available from each system and loses nonconforming, but important, data from each.
•Inventing and introducing "consistent element names" simply introduces yet another set of names, obscures the original underlying data source, and contributes little (or detracts from) the effort of establishing complete data documentation. (A focus on "data elements" exposes the "flat file" simplification, and ignores the significance of more complex data structures.)
•Although easy "on the fly" transformations are done, the more difficult source data anomalies (that probably need to be resolved by hand, one at a time) are neglected.
[h2]Load:[/h2]
•Enforcing constraints and data quality in the load ensures that (errors in) the source systems are rejected or cause the load to fail – creating inconsistent and conflicting representations of the data.
•Using batch data, the warehouse data is immediately out of date (i.e. wrong!) v. the dynamically changing source data.
•Data needs to be based on actual (source) data, not cleaned up, harmonized, edited, transformed data which serve the needs of the data warehouse, at the expense of data accuracy and timeliness.
The alternative is to have direct access to the source data (dynamically linked, heterogeneous definitive data sources) for specific queries when needed. This approach (taking advantage of query processing at the linked source) would ensure up-to-date information, and substantially reduce communications, data transfer requirements, and total processing requirements. In other words, rather than attempting to periodically transfer all the answers, simply provide the answers to the questions that need to be answered when they are asked.
[h1]Reports:[/h1]
•A batch concept left over from high speed printer, multi-part green-bar paper days.
•Manual "eyeball" review is inefficient and inaccurate, irrespective of the delivery mode (paper, web page, green screen, etc.)
•Are limited (typically) to parameterized selection of field contents and sort order. Often difficult to make selections based on comparisons of multiple records within the same data set (self joins) or comparisons with casual external data sources (e.g. Excel Workbook or text data inputs.)
•Often base on a "flat file" over simplification, which does not allow complex joined queries based on dynamic links.
•Often not dynamic (limited filter, drill down, summary)
•Often limited to static attributes (column values) from loaded snapshot reports which do not reflect subsequent corrections, retroactive postings, etc.(Data needs to be based on actual (corrected) transactions, not historic snapshots of what the data was thought to be some time in the past.)
•A good insightful report, once acted upon, loses importance, and becomes useful only to monitor (inefficiently and inaccurately) the effectiveness of the corrections implemented. It is naïve to assume some "set of reports" can be defined to meet ongoing business requirements.
•Without extensive detailed documentation of how each report is created, there is a high risk of misinterpreting report data.
•Data analysis (not report) requirements are always new, different, and changing to meet new questions as they arise.
•Reports are only a capture of selected data; the raw data itself needs to be available for subsequent processing, matching, and querying. Exporting report data to Excel or CSV files simply adds unnecessary manual steps (which need to be repeated periodically to keep the data current) before the data is useful.
July 22, 2008 at 12:12 pm
I think you've tackled a lot here and I'll give you a few thoughts on the DW piece.
I think there are companies that build a well designed data warehouse. The thing is that a DW is an evolving process, it continually loads from your disparate systems on a regular, likely daily, process that handles updates, new data, deletes.
A DW should be the source for all data. It should have data cleansed and standardized before it's added/inserted, and then it should feed downstream systems, potentially updating source systems in a closed loop to fix errors, for reporting and analysis.
I don't think there are problems with people directly accessing DWs with complex queries, but this shouldn't impact performance. If it does, perhaps a specific set of data moved to a mart or a cube should be used for those requirements.
I'm not sure what vendors are doing in this space, but it's a long term, large investment buy-in from a company to implement a DW. Most that make that investment find it's a good investment, but it's a significant investment in time and resources.
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply