September 14, 2011 at 6:06 am
I am trying to write some t-sql query to cleanse a free text address field.
Some of sample addresses that i get are:
Address
--------
1/110 Smith Street, Melbourne VIC,Australia, 3333
Flat1 111 Main Street, Ipswich, QLD, Australia, 4444
Flat 5, 10 Central Street, Paramatta, NSW Australia, 9999
8 Ocean Road, Geelong, VIC, 3334
I want to parse the free text field into proper address fields as:
Unit, Street Number, Street Name, City, State, Country, PostCode
Any help will be appreciated. Note that the address is not a comma separated values. It is a free text field and can have any combinations such as spaces, /, tab, etc...
September 14, 2011 at 6:41 am
What you're talking about is pretty much impossible to do accurately.
For example, to get the country out of that string, you'd need to already have a list of every country and every possible permutation that someone could write their country down as: -
DECLARE @table AS TABLE (addressField VARCHAR(200))
INSERT INTO @table
SELECT '1/110 Smith Street, Melbourne VIC,Australia, 3333'
UNION ALL SELECT 'Flat1 111 Main Street, Ipswich, QLD, Australia, 4444'
UNION ALL SELECT 'Flat 5, 10 Central Street, Paramatta, NSW Australia, 9999'
UNION ALL SELECT '8 Ocean Road, Geelong, VIC, 3334'
SELECT addressField,
CASE WHEN MAX(PATINDEX('%'+pat.[countries]+'%',addressField)) < 1
THEN NULL
ELSE SUBSTRING(addressField,MAX(PATINDEX('%'+pat.[countries]+'%',addressField)),LEN(addressField))
END AS countryFirstScrub
FROM @table a
--You would need every single valid country listed here
CROSS JOIN (VALUES ('Afghanistan'),('Albania'),('Algeria'),('Andorra'),
('Angola'),('Antigua & Deps'),('Antigua and Deps'),
('Antigua'),('Argentina'),('Armenia'),('Australia'),
('Austria'),('Azerbaijan'),('Bahamas'),('Bahrain'),
('Bangladesh'),('Barbados'),('Belarus'),('Belgium')) AS pat([countries])
GROUP BY addressField
You'd then need to somehow manage to account for people that have street names that are the same as country names, or street names that contain country names. All in all, it's a nightmare of epic proportions.
e.g.
17 Peru Road, Sydney, New South Wales 8927
With the eye, it's reasonably simple to know that the above address is in Australia. But the code I supplied above would find "Wales" as the last PATINDEX and so display the country as Wales.
--EDIT--
September 16, 2011 at 1:17 am
If you're using this for commercial purposes, I'd recommend using a professional address cleansing service as the increased accuracy should pay for itself.
Many have API's that you can send data to in bulk, but others have software packages you can install locally.
They use combinations of fuzzy logic and address registers to transform data and send back clean, verified postal addresses in a standard format. It's a complicated process to build yourself and it's one of those areas where a specialist service is almost always more effective, unless you have clear control of the source of the data and can do a postal address lookup at the point of entry.
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply