March 10, 2008 at 10:20 pm
Hello All,
I've been charged with cleaning up 2-3 tables which contain approx 1 millions records each.
The problem I'm experiencing is, some of the companies (an important column) aren't listed in the same way. For example, I might have 'AT&T', 'A.T.&T.', 'AT & T', 'A T & T', etc...
In the above case, I'd like for them all to be 'AT&T', I'm looking for the best way to proceed.
Thanks in advance,
Stephen
March 11, 2008 at 6:42 am
Stephen Lee (3/10/2008)
Hello All,I've been charged with cleaning up 2-3 tables which contain approx 1 millions records each.
The problem I'm experiencing is, some of the companies (an important column) aren't listed in the same way. For example, I might have 'AT&T', 'A.T.&T.', 'AT & T', 'A T & T', etc...
In the above case, I'd like for them all to be 'AT&T', I'm looking for the best way to proceed.
1. you have to write a function that normalizes such strings-removes double space, dots, spaces around &,...
2. update the column(s) with this function, where you end up with lots of duplicates.
3. update all linked columns identified by foreign keys where you choose one of duplicates
4. delete the duplicates
5. add a clientside and serverside (trigger) logic to prevent such entries.
Also, you should have a unique natural key, if possible.
Some entities have no natural key, sometimes you may not change the crappy database model.
March 11, 2008 at 6:51 am
Bloody hard and tedious work awaits.
You have to update all the data and find the different variances for the company names. Worst possible problem to have. Follow the steps outlined above because you need to do everything you can to prevent more bad info from being introduced to the system while you go through cleaning it up.
One possibility would be to add a table that has the correct names along with an FK to your existing table, nullable. Then rewrite the queries so that they go for the lookup against the second table first, and pull from the junk in your existing table second, if no value has been added for a company name. Your insert & update code would have to be modified too.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 11, 2008 at 7:34 am
Grant is correct... great pains await you especially if you try to take any shortcuts. Whatever you do, make sure you have a viable backup table somewhere. You will get many entries wrong unless you are very, very careful... and it's just not possible to be that careful where this task is concerned... you will make at least one error. You should try your code on a copy of the table first and get 'bout a thousand other eyes to look at the "before and after".
--Jeff Moden
Change is inevitable... Change for the better is not.
March 11, 2008 at 7:45 am
Not just that. Such data comes from poorly designed database model, client app and bad user habits.
A user is inputting an invoice. Does not use an available lookup to pick a company, but inserts new, types "AT&T", which is rejected "duplicate name", so corrects it to "A.T.&T.". If the only check is by name, you end up with trash data.
So, you must educate users, add natural key validation and some sanity check in client.
March 11, 2008 at 12:42 pm
For a onetime cleanup operation the Fuzzy Lookup or Grouping transformations in SSIS would help out; Developer and Enterprise Edition only. The attached screen-shots show how both the Lookup and Grouping worked against your given input sample, including an example of where it could go wrong.
Adjusting the thresholds and checking the scores that are provided can help refine the 'correctness' of the results.
At the very least it gives you a grouping of Company names to start working with.
References:
Finding Similar Data Using SQL Server Integration Services[/url]
Using Fuzzy Lookup Transformations in SQL Server Integration Services
Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply