Data Scrubbing

  • I regularly have to import text files into a table. One problem that I have run into is the Name column containing the full name. I need to spilt the column out into three columns (LastName, FirstName, MI). These are some of the formats that I run into: John J Doe | Doe,John,J | Doe John J | I am in need of a SP or T-Sql statement that can separate this one column out into three columns.

  • There are several scripts on this site that let you split a string into pieces. Because you have different delimiters I would look for a function where you specify which delimiter to search for.

    However, you still have to deal with correct assignment to your columns.

    Any chance to get a better input format?

    Frank

    http://www.insidesql.de

    http://www.familienzirkus.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • The files that I receive are usually in rough shape. The most common way the files I receive have a full name in the first column i.e. Doe,John J I need to parse this out to three columns (LastName, FirstName, MI).

  • Is this a one-time action or do you need to do this on a regular basis?

    I thought of utilizing something like this

    http://www.sqlservercentral.com/scripts/contributions/225.asp

    or that

    http://www.sqlservercentral.com/scripts/contributions/592.asp

    Frank

    http://www.insidesql.de

    http://www.familienzirkus.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • This is an on going project. I take comma delimited files and import the data into tables. This has been a big problem separating this field out. I was thinking that a DTS script would work while I am importing the file in. What do you think?

  • I think DTS s*cks...but it would be an good option, I guess.

    Maybe you should start this thread again in the DTS section.

    Frank

    http://www.insidesql.de

    http://www.familienzirkus.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • This would convert the data specified and I know the data may be more complex. Whether you do the conversion pre, post or during load depends on available tools and the complexity. maybe this will get you going.

    CREATE TABLE #temp (name varchar(50),Lastname varchar(10), FirstName varchar(10),MI varchar(10)) 
    
    INSERT INTO #temp (name) VALUES ('John J Doe')
    INSERT INTO #temp (name) VALUES ('Doe,John,J')
    INSERT INTO #temp (name) VALUES ('Doe John J')

    UPDATE #temp SET name=REPLACE(REPLACE(name,' ','.'),',','.')

    UPDATE #temp SET Lastname=PARSENAME(name,3),
    FirstName=PARSENAME(name,1),
    MI=PARSENAME(name,2)

    UPDATE #temp SET FirstName=LastName,LastName=FirstName WHERE LEN(MI)=1
    UPDATE #temp SET FirstName=MI,MI=FirstName WHERE LEN(FirstName)=1

    SELECT * FROM #temp

    DROP TABLE #temp

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Thanks Frank, David for the suggestions!

  • You could use a regular expression for doing this. Theres an extended SP that allows you to use regexps in SQL Server 2000

    http://www.codeproject.com/managedcpp/xpregex.asp

  • I have been dealing with this common EDI issue for many years now.

    Although there are many different formats that the name can be formatted in, there are also many unique circumstances you will have to program for.

    Here are the unique issues to be concerned with

    - No Middle Initial; You will have to nail down a format standard that accounts for names of people that don't have a middle name for some reason.

    - What I like to call the "Legacy Name" indicator(Jr., Sr., III, etc.); Again, you will have to nail down a format standard to account for those people who have this in their name.

    - Split names; Some people have first and/or last names that are split. A particularly brutal example would be Full Name: "Anand Sid Dur Aramad"(name changed slightly to protect the innocent) - First name was Anand Sid, Last name was Dur Aramad AND this person had no Middle Initial. We resolved this situation by asking that the format have a hyphen included in places where the name is split.

    - Data entry variation;This is a go/no-go problem. If the data source was populated over time by many data entry users that used data entry rules for the names that varied from person to person and in some cases changed from time to time, then you may never be able to parse out the names because no solid standard was used or is available.

    If the name is Bill James, and the name was entered James Bill by one person and Bill James by another, no computer I know of has the ability to determine if James is definitively a first name or a last name.

    Computers are great at logic, but not very good at guesswork or social knowledge. Trust me, I have seen this very problem result from poorly designed systems that didn't impose a standard on the entry of name data.

    Quite often, we worked with the data provider to identify the exact standard they use for the full name in the data they provide us. Of course, many people that provide this information have their hands tied in regard to format due to the technology of the datasource or the actual format of the data in the datasource. It is sometimes very difficult for them to make changes or customize their export because their skill level in programming or knowledge about their datasource is also limited. It is critical to nail down a standard that the data provider absolutely must stick to. Any variation in data format will nullify any work you put into the import and parsing logic. They should not be allowed to use fixed width text one week and then decide to use excel the next week for the same data extract. Any new formats in the data will mean more of your man hours developing the parsing code needed to import the data.

    Generally, I approached each problem in a logical manner.

    1) Get the entire trimmed full name into a field in a table of some kind first. It will be easier and generally faster to have SQL Server do the parsing work for you. Especially with larger files. DTS would work as a parsing tool, but it always seemed much slower in that capacity. Rule> Do Not modify the data in this field other than to trim of end spaces - keep it original so you have a way to quickly review the source of any errors during your development or troublshooting activities. You can delete or drop the data later in your processing after you have parsed everything out, but I am a fan of having a import troubleshooting data trail.

    2) Develop the SQL code to parse out the first name and last name with as close to 100% accuracy as possible. These are usually the easiest to extract because they are usually more than one alpha character in length. Work on each name part at a time to keep it simple if necessary.

    3) Work on the SQL logic to extract the middle initial. If a period was put next to the middile initial, you have a good delimiter to work with. Otherwise, you may have to assume that a single alpha character is always going to be your middle initial.

    4)If you need to extract the "legacy name", try to work out logic that would work with the format provided to get as high a percentage of accuracy as possible.

    5)For those situations where there may be split names, it may take a command decision by the data owners as to how they want to deal with this situation. There are many ways to approach this problem that are reasonable. Fortunately, this kind of problem is rare. Just be aware of it and try to account for it if you can.

    In some cases, I have had to create a temporary table in the parsing script to help "chop up" the full name.

    I would parse out the first name and then remove it from my "scratch pad" name. I would then trim the scratch pad name to eliminate any spaces on the ends.

    Then I would do the same for the last name.

    Then I would analyze what is left and define logic code to handle the remains(middle initial, legacy name, other)

    Of course, sometimes it was necessary to extract the middle initial first, then look for legacy name data.

    There is no magical SQL code that would handle all name format circumstances. Get very familiar with the following functions:

    ltrim()

    rtrim()

    replace()

    len()

    substring()

    left()

    right()

    charindex()

    I hope this advice helps

    - Al

    Funny thing about people and their computers...

    The computer is completing millions of processes every second, but the user still thinks it is too slow.


    "I will not be taken alive!" - S. Hussein

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply