Massive clean-up project

  • Hello All,

    I've been charged with cleaning up 2-3 tables which contain approx 1 millions records each.

    The problem I'm experiencing is, some of the companies (an important column) aren't listed in the same way. For example, I might have 'AT&T', 'A.T.&T.', 'AT & T', 'A T & T', etc...

    In the above case, I'd like for them all to be 'AT&T', I'm looking for the best way to proceed.

    Thanks in advance,

    Stephen

  • Stephen Lee (3/10/2008)


    Hello All,

    I've been charged with cleaning up 2-3 tables which contain approx 1 millions records each.

    The problem I'm experiencing is, some of the companies (an important column) aren't listed in the same way. For example, I might have 'AT&T', 'A.T.&T.', 'AT & T', 'A T & T', etc...

    In the above case, I'd like for them all to be 'AT&T', I'm looking for the best way to proceed.

    1. you have to write a function that normalizes such strings-removes double space, dots, spaces around &,...

    2. update the column(s) with this function, where you end up with lots of duplicates.

    3. update all linked columns identified by foreign keys where you choose one of duplicates

    4. delete the duplicates

    5. add a clientside and serverside (trigger) logic to prevent such entries.

    Also, you should have a unique natural key, if possible.

    Some entities have no natural key, sometimes you may not change the crappy database model.

  • Bloody hard and tedious work awaits.

    You have to update all the data and find the different variances for the company names. Worst possible problem to have. Follow the steps outlined above because you need to do everything you can to prevent more bad info from being introduced to the system while you go through cleaning it up.

    One possibility would be to add a table that has the correct names along with an FK to your existing table, nullable. Then rewrite the queries so that they go for the lookup against the second table first, and pull from the junk in your existing table second, if no value has been added for a company name. Your insert & update code would have to be modified too.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant is correct... great pains await you especially if you try to take any shortcuts. Whatever you do, make sure you have a viable backup table somewhere. You will get many entries wrong unless you are very, very careful... and it's just not possible to be that careful where this task is concerned... you will make at least one error. You should try your code on a copy of the table first and get 'bout a thousand other eyes to look at the "before and after".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Not just that. Such data comes from poorly designed database model, client app and bad user habits.

    A user is inputting an invoice. Does not use an available lookup to pick a company, but inserts new, types "AT&T", which is rejected "duplicate name", so corrects it to "A.T.&T.". If the only check is by name, you end up with trash data.

    So, you must educate users, add natural key validation and some sanity check in client.

  • For a onetime cleanup operation the Fuzzy Lookup or Grouping transformations in SSIS would help out; Developer and Enterprise Edition only. The attached screen-shots show how both the Lookup and Grouping worked against your given input sample, including an example of where it could go wrong.

    Adjusting the thresholds and checking the scores that are provided can help refine the 'correctness' of the results.

    At the very least it gives you a grouping of Company names to start working with.

    References:

    Finding Similar Data Using SQL Server Integration Services[/url]

    Using Fuzzy Lookup Transformations in SQL Server Integration Services

    Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply