Free Text Address Field Formatting

  • I am trying to write some t-sql query to cleanse a free text address field.

    Some of sample addresses that i get are:

    Address

    --------

    1/110 Smith Street, Melbourne VIC,Australia, 3333

    Flat1 111 Main Street, Ipswich, QLD, Australia, 4444

    Flat 5, 10 Central Street, Paramatta, NSW Australia, 9999

    8 Ocean Road, Geelong, VIC, 3334

    I want to parse the free text field into proper address fields as:

    Unit, Street Number, Street Name, City, State, Country, PostCode

    Any help will be appreciated. Note that the address is not a comma separated values. It is a free text field and can have any combinations such as spaces, /, tab, etc...

  • What you're talking about is pretty much impossible to do accurately.

    For example, to get the country out of that string, you'd need to already have a list of every country and every possible permutation that someone could write their country down as: -

    DECLARE @table AS TABLE (addressField VARCHAR(200))

    INSERT INTO @table

    SELECT '1/110 Smith Street, Melbourne VIC,Australia, 3333'

    UNION ALL SELECT 'Flat1 111 Main Street, Ipswich, QLD, Australia, 4444'

    UNION ALL SELECT 'Flat 5, 10 Central Street, Paramatta, NSW Australia, 9999'

    UNION ALL SELECT '8 Ocean Road, Geelong, VIC, 3334'

    SELECT addressField,

    CASE WHEN MAX(PATINDEX('%'+pat.[countries]+'%',addressField)) < 1

    THEN NULL

    ELSE SUBSTRING(addressField,MAX(PATINDEX('%'+pat.[countries]+'%',addressField)),LEN(addressField))

    END AS countryFirstScrub

    FROM @table a

    --You would need every single valid country listed here

    CROSS JOIN (VALUES ('Afghanistan'),('Albania'),('Algeria'),('Andorra'),

    ('Angola'),('Antigua & Deps'),('Antigua and Deps'),

    ('Antigua'),('Argentina'),('Armenia'),('Australia'),

    ('Austria'),('Azerbaijan'),('Bahamas'),('Bahrain'),

    ('Bangladesh'),('Barbados'),('Belarus'),('Belgium')) AS pat([countries])

    GROUP BY addressField

    You'd then need to somehow manage to account for people that have street names that are the same as country names, or street names that contain country names. All in all, it's a nightmare of epic proportions.

    e.g.

    17 Peru Road, Sydney, New South Wales 8927

    With the eye, it's reasonably simple to know that the above address is in Australia. But the code I supplied above would find "Wales" as the last PATINDEX and so display the country as Wales.

    --EDIT--

    This link[/url] might help.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • If you're using this for commercial purposes, I'd recommend using a professional address cleansing service as the increased accuracy should pay for itself.

    Many have API's that you can send data to in bulk, but others have software packages you can install locally.

    They use combinations of fuzzy logic and address registers to transform data and send back clean, verified postal addresses in a standard format. It's a complicated process to build yourself and it's one of those areas where a specialist service is almost always more effective, unless you have clear control of the source of the data and can do a postal address lookup at the point of entry.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply