Blog Post

Data Quality Matching in SQL Server

,

Every IT person is more than introduced to the problems of integration – in the ideal perfect world we could join any 2 or more given data sources together, while in the real world it might the biggest pain in the neck.

To join different data we need to match 1 or more columns from the different data sets a lot of times one would need to convert one of the sources to some common datatype before trying to join it.

Some times there might be duplicates in our data, and it is happening much more often than one might think. Data Quality does not matter much, until someone starts looking at the reports and by then it is already quite late to take care of it. Having 2 clients with names that have just 1 letter of difference seems to be a problem of the 20th century, but in the reality it is one of the most common problems that exists since beginning of times.

There is a big need and necessity to deliver Data Quality solutions in each and every complex solution (and sometimes even in the smaller ones).

Matching a person named ‘James Bond’ with ‘Jammes Bond’ might be a tricky business if doing it with simple queries and on bigger datasets is simply impossible.

Everyone needs a good practical solution.

Of course we can match data by using = (equality) operator, or using LIKE operator but once facing misspellings those methods are not really up to the challenge.

In SQL Server 2008 R2 we had Master Data Services (MDS) introduced to help addressing some of the issues, but the biggest promise were in Data Quality Services (DQS) delivered in SQL Server 2012.

In Master Data Services we can obtain interesting results by comparing strings similarity, which adds Levenshtein distance plus a couple of others algorithms such as Jaro-Winkler distance and Longest subsequence.

Data Quality Services

DQS Matching

I will leave my full thoughts and feelings on the quality of the DQS until some later post. <img src=" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Yes, really I am working on a project that is using DQS. <img src=" class="wp-smiley" style="height: 1em; max-height: 1em;" />

In the current versions of DQS (SQL Server 2012 & 2014) you can build Knowledgebases and perform Data Cleansing & Data Matching, all while in the wonderful Interface of the DQS Client application.

Let us be honest – nobody wants to run every operation manually, we are IT people and we are looking to automate everything, and since the tool of our choice is SSIS (SQL Server Integration Services), and so Microsoft has provided us with a SSIS component for Cleansing operation.

But what about DQS Matching ???

Surely someone would incude this component into SSIS.

Well, it took a great work of a friend of mine (and my colleague at OH22) to develop SSIS DQS Matching Transformation plugin, which was done with some help & support from Microsoft.

There was an important blog post from the Microsoft DQS Team in 2013 announcing this DQS Matching plugin and Matt Mason gave a couple of presentations at some major events around the world, including PASS Summit giving this component a big promotion.

Still I feel like the vast majority of the people have not discovered it yet.

This plugin is far from being perfect, it is still in the very first release but give it a try and give some honest feedback, because I would love to see it improved and polished. I use it in a real world project and I can’t imagine how things would be solved without it.

Sources:

Data Quality White Paper

SSIS DQS Matching Transformation

Disclaimer: I am working at OH22 and so you might consider this as a plug for the company. You are right, I am plugging in a free source that a friend of mine has developed for everyone to use.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating