Cleaning Up a Table

  • I have a list of contacts approx 50,000 records. 2 Problems:

    1)Duplicates: Company name, etc... Ph # Fax # etc...

    Some of the Dups may be 'Wiley Company' AND 'The Wiley Co.' How do i run a delete command to get rid dup Companies.

    2) The Column w/ ph/fx numbers is a smallint, but i noticed after importing that obviously non-int's were imported into the column, like (405)322-2013. how can i clean up this column and convert the data to small ints? Also, want to get rid of duplicate ph/fx numbers at that point as well.

    any help would be greastly appreciated!

    f

  • 1) Do you have some sort of company ID. There's no way for SQL Server to tell if Wiley Company and The Wiley Co. are actually the same. (you'd have to be using semantic search or fuzzy lookup in SSIS).

    2) You can use the replace function to replace all non-integer characters with empty strings.

    For example:

    SELECT REPLACE(myColumn,'(','') FROM myTable

    An alternative could be to use Data Quality Services (introduced in SQL Server 2012) to clean-up the data.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • fgrubercpa (8/13/2013)


    I have a list of contacts approx 50,000 records. 2 Problems:

    1)Duplicates: Company name, etc... Ph # Fax # etc...

    Some of the Dups may be 'Wiley Company' AND 'The Wiley Co.' How do i run a delete command to get rid dup Companies.

    2) The Column w/ ph/fx numbers is a smallint, but i noticed after importing that obviously non-int's were imported into the column, like (405)322-2013. how can i clean up this column and convert the data to small ints? Also, want to get rid of duplicate ph/fx numbers at that point as well.

    any help would be greastly appreciated!

    f

    You will not be able to store a phone number like 4053222013 in a smallint column. The range of smallint values is -32,768 to 32,767 - even a 7-digit phone number would be far outside the range of the smallint datatype. In practice, you couldn't store many 10-digit phone numbers even in an int column - you'd have to use bigint.

    You should consider whether you want to store phone numbers as a numeric data type at all. There's never a need to perform calculations on them - adding a phone number to another phone number is nonsensical, for example. Typically, applications want to display phone numbers in formats that can't be represented in a numeric datatype, such as "(901) 555-1212", so every time they're retrieved, they'll have to be converted to a string data type. It often makes more sense to store them as varchar() datatypes, even though a 10-digit phone number as a varchar(10) requires a couple of bytes (plus the row overhead associated with the varchar() datatype) more than a bigint.

    Jason Wolfkill

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply