This post is part of a series on this blog that will help me, and hopefully you, pass exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012. So far, we’ve covered:
- Tables and Schemas in the data warehouse
- Dimensions and Slowly Changing Dimensions
- Fact tables and measures
- Intro to columnstore indexes
- Columnstore indexes and partitioning
- Introduction to SQL Server Data Tools (SSDT)
- Connection Managers
- Control Flow Tasks and Containers
- Data Flow Source Adapters
- Data Flow Destination Adapters
- Data Flow Transformations
- Variables and Parameters
- SSIS Expressions
- Transactions and Checkpoints
- Logging in SSIS 2012
- Deploying projects in SSIS 2012
- Executing Packages in SSIS 2012
Today we turn our attention away from SSIS and begin focusing on another big part of building a data warehouse: master data management (MDM).
What is Master Data?
According to Wikipedia, master data “represents the business objects which are agreed on and shared across the enterprise.” I’m not sure how helpful that is. But let’s say you have a relational database like AdventureWorks. AdventureWorks can be described in sentences. “Customer 101 ordered a bicycle on September 21″, “A red bicycle was delivered to customer 101 on September 25″, “Salesperson 5 sold $2m in helmets in 2014″, etc. The verbs in those sentences represent transactional data, e.g. sales, deliveries. The nouns, however, represent master data, e.g. customers, sales reps, and products. In general, master data represents people, things, concepts, and places.
Managing Master Data
With that said, not all data that can be considered master data necessarily needs to be managed as master data. If you only have 5 customers, for example, you would be unlikely to treat customers as master data, simply because there’s no real need. It’s pretty easy to manage 5 elements, and the overhead of managing a master-data management system would outweigh the benefits. Similarly, simple entities with few attributes are also less likely to be managed as master data. The higher the value of an entity, the more likely it is to be managed. And the more volatile data is, the more likely it will be managed as master data. Entities that don’t change over time, or don’t change often, are easier to maintain correctly without a master-data solution. But probably the most important factor in determining whether data needs to be managed as master data is reuse. Data that is used across multiple systems should almost always be managed as master data to ensure every system is getting an accurate and consistent picture of the data.
Creating Master Data
In a perfect, and simplistic, world, we would only ever have one source of our data and that data would be accurate and consistent. Unfortunately, you and I live in the real world. Therefore, there are two basic actions we need to take when building out our master data. First, we need to clean and standardize our data. This means normalizing data formats, replacing missing values with defaults or lookup values, standardizing values to use the same currency or system of measurement, and parsing/mapping values to their correct attributes. This process can be accomplished using a specific master-data management tool, or you can do it using SQL or SSIS.
Once the data has been cleaned, it needs to be merged from the individual sources into a single list. This is the tricky part, because it means matching data in order to consolidate duplicates. An incorrect match can lead to lost data. A missed match leaves duplicate records and reduces the value of maintaining the master data in the first place. Because matching from disparate sources is rarely straightforward, this is where a specialized tool comes in handy. These tools can match on multiple attributes, often using complex algorithms, to determine a degree of confidence in a match. The degree of confidence is a measure of how sure the tool is that these two records are the same entity. The more attribute matches, the higher the degree of confidence. You, as the user, can configure the tool so that a confidence level of 95% or higher is automatically merged, whereas a confidence level of 80% to 94% must be manually inspected first. How you configure these confidence levels will depend on the consequences of a false match. If those consequences are significant, you might raise the auto-merge threshold. The downside of that is more manual work on your part to inspect possible matches.
Maintaining Master Data
Once you’ve created your master data, where do you store it? How do you maintain it going forward? And how will you make it available to consumers? The first option is to have a single copy of the master data that all other systems use. All master data maintenance is performed directly on this copy. While this ensures everyone sees the same data, it also means that existing applications must be modified to access this data, which may not be possible or practical. Another option is to maintain the master data in a central location, but push this data out to the individual source systems. In this topology, attributes that are considered part of the master data are modified in the master data repository only, and those changes are replicated to the source systems. Any attributes specific to the source system, but not part of the master data can be modified locally. This may mean fewer changes to the source applications, but they’ll probably still need to be modified to prevent updates to master data attributes. A third option is to allow updates to master data to occur in the source systems, and those changes are replicated to the master data repository, where they are merged and then pushed out to the other source systems. This requires few, if any, changes to the source applications, but it’s more difficult to maintain quality, consistent data, which is the whole point of implementing a master data management solution.
Additional Resources
That’s a quick primer on master data as a concept. Next time we’ll look at Microsoft’s MDM tool offering, Master Data Services. For more information on master data, check out the following resource: