String Manipulations

  • I have a huge problem at hand....

    I have a lot of addresses in which I have to add a comma just before the city name

    Some example strings could be

    2001 Main St New York, TX 7777777

    175 Old Shopping Ctr Houston, TX 7777777

    84 King George Rd Brantford

    2 - 428 Gage Ave. New Orleans, ON

    So as you see there is no fixed pattern in the strings. Moreover there can be city names with 2 words for example New York so the comma has to fall just before "New".

    Have been thinkin a lot bout this. Any suugestions will be greatly appreciated.

  • Rather than search for the city name, what about searching for the state and then adding the comma before it?

  • This is pretty bad news! Is there any way that you can get a complete list of all city names into a separate SQL Server table?

    You could then traverse your address list and search each entry for one of the city names in the separate table. Then you do some string manipulation & you're there.

    Not elegant (but then again, neither is having a single 'address' field in a database)!

    Good luck.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • The problem is I cannot get a list of city names. I have to work with only the kind of strings that I mentioned above.

    I know its gonna be difficult, but lets see I am working on an algorithm for this. I dont know how much this will work but I am going to reverse the string and try isolating every word because the only thing consistent I have is that all the words are separated by spaces. And then maybe write a MILIION if statements to figure out if its a city or not.

    Coz if I reverse the string the first word can be a zip or a two char long state or the city i can identify the zip coz its a numeric value the state if present is always two chars but the only problem in this case would be city names with two words. eg.New York

    So if I can find a way to deal with such situations my problem is solved....

    I know this approach sucks but do I have a choice

    Any suggestions will be greatly appreciated.

  • Hello Dinesh,

    there is only one regular thing I could observe in your sample. The city is either last entry, or the rest is separated by a comma... while there are no other commas anywhere else in the string. If that's always so, you could strip away everything beginning with a comma from the string, and that would leave City as the last entry in all records. But how to search for cities like "New East Something", "Baile Atha Cliath" or "Koln am Rhein"... well, that's really hard to code without city register. Besides, there can be both Haven and New Haven; if you strip away just Haven, "New" will probably not be found as a city.

    I had a similar problem not long ago, when I had to delete from book titles various remarks done by users (ranging from "58%" to "VrK", "st" or "prod"). These were divided from the true title of the book by various number of spaces - from 0 to 10 -, several of them could appear in one title, and some of the strings could really form part of the title. Don't ask me why they did remarks directly into the name of book, it's beyond my comprehension. I simply had to clear it, as these frightful names would appear on web now. Luckily, the number of titles was not that high, some 15000 in all, and only a few hundred had such remarks. I solved the problem by finding the most often appearing and distinctive strings /about 30 of them/, then wrote several different update queries that would strip such texts away from a string only if it is at the end - and ran them again and again, to work through the titles that had several remarks, until I cleared almost 90% of the mess... well, and then I had to work through the rest step by step... changing entries manually, not using a query.

    Why am I writing this? Because I think that if you can make a table of the most often appearing cities, update the strings adding a comma in front of city names, and mark these as OK, you will be left with some considerably smaller portion of records to update. If it still is thousands or hundreds of thousands (I don't know what "many" in your post means"), then you have to think about another query(ies) that would help you. But anyway, it will be easier to find them if all New Yorks, Washingtons and Denvers are out of the way. Repeat the process again and again, until you go mad 🙂

    IMHO, no matter what you do, you will never be able to solve this problem 100%. The very last portion of addresses will have to be done manually, and there will be some addresses processed incorrectly. Make sure you have a copy of original version, so that you can find out what happened... and save it for a long time. Some of the errors may appear after several months.

    I wish you best of luck... you'll need it!

    Vladan

  • Hey,

    Well, one other thing your street address has the following suffixes shown above:

    St

    Ctr

    Rd

    Ave

    There is a list of these suffixes (I believe) on the US postal services web site.  If you search for everything between this suffix and the comma, you may be in luck.  However, that's going to be one big query.

    Brian

  • This is clearly a semi-manual process.  Consequently, I would not try to solve it in one-fell-swoop.  

    1. Use both VB and SQL to do portions of the job. 
    2. Work right to left.  
    3. Using VB to split the field into zipcode and remaining text - inserting these new fields into the table. 
    4. Use SQL to check the zipcode field using Select Distinct zipcode.
    5. Manually fix any errors.
    6. Use VB to split out state and remaining text - inserting state into a new field and update the remaining field.
    7. Use SQL to check the state field using Select Distinct state.
    8. Manually fix any errors using SQL to find errors.
    9. etc.

    -Karl

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply