Splitting a non-uniform concatenated address column into 2 different columns

Question

Splitting a non-uniform concatenated address column into 2 different columns

Passionate SQL Server Developer

SSC Veteran

Points: 287
More actions
January 13, 2015 at 1:46 pm

#313630

Hello,
I have a very interesting problem in T-SQL coding for which I can't figure out the solution. Actually there is a Line_1_Address column in our data warehouse address table which is being populated from various sources. Some sources have already concatenated house number + street address fields in the Line_1_Address column whereas one source has separated columns for both data fields.
Now I'm trying to extract data from this data warehouse table and I need to split the house number from street address and load it into separate columns in my destination table. In case there is no data for house number then I should load it as NULL.
The issue is that data in this Line_1_Address column is very inconsistent so I don't know which functions to use. If anyone can suggest me a possible solution then that would be great. here is some sample data for your consideration:
Line_1_Address
101 E Commerce ST
120 E Commerce ST
2 Po Box
301 W. Bel Air Ave
West Main Street, PO Box 1388
116 N Louisville ST
903 Po Box
2950 W Market ST
300 N. Cleveland-Massillon Road
820 Main Street
521 1st ST N
2301 Dawson RD
611 Pointe North BLVD
173 North 2nd Street
410 Hillabee ST
115 North St. Asaph Street
115 North St. Asaph Street
115 North St. Asaph Street
115 North St. Asaph Street, PO Box 26248
1800 N. Beauregard Street
1800 N. Beauregard Street, #100
1800 N. Beauregard Street, Suite 1800
211 N. Union Street, Suite 154
228 South Washington Street, Suite 200
228 South Washington Street, Suite 200
5500 Cherokee Avenue, Suite 300
South Peyton Street
1605 N. Cedar Crest Blvd. Suite 410
2015 Hamilton ST STE 205
2222 South 12th Street
2895 Hamilton Blvd, #201
2895 Hamilton Boulevard
3570 Hamilton Blvd Ste 302
5000 Tilghman Street, Suite 325
5910 Hamilton Blvd
Hamilton Boulevard, Suite 105, The Atriu
3100 Royal BLVD S
3655 North Point PKWY STE 425
3375 Lynnwood Drive
108 Main St S
1831 E Three Notch ST
1807 N. Boulevard
1924 McConell Springs Road, PO Box 710
2315 N Main Street Suite 100
2315 N. Main Street, Suite 100
2315 N. Main Street, Suite 100
2315 N. Main Street, Suite 100, PO Box 2
N Main Street Suite 100, PO Box 2307
4200 Evergreen Lane
1610 West Street, Suite 105

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Sigerson Hall of Fame Points: 3697 More actions · Answer 1

First of all, good luck. You have some really filthy data to deal with.

Read up on PATINDEX. It may give you some ideas about parsing these strings. It searches for one string within another and returns an integer of the first occurrence. It can take wild card arguments which makes it useful.

Here's an example that will let you find out if your address begins with digits or not:

Declare @a varchar(30)

Set @a = '2950 W Market ST'

PRINT PATINDEX('%[^0-9]%',@a)

Result = 5, which is the first character that is NOT a digit.

[0-9] searches for a match of any digit. [^0-9] searches fpr a match on the first NON-digit.

More examples for PATINDEX in BOL

Good Luck,

Sigerson

"No pressure, no diamonds." - Thomas Carlyle

ManicStar SSCoach Points: 17992 More actions · Answer 2

You're going to have to do some sophisticated regular expression parsing to get about 90% of the way there. You probably won't be 100% successful at splitting all the records up correctly.

That being said, I would focus on the bulk, and plan on reporting on the residue. SQL supports some regex (patindex, like etc) processing, but its not going to support the more sophisticated expressions one could use for this purpose.

Passionate SQL Server Developer SSC Veteran Points: 287 More actions · Answer 3

Thanks for your replies guy. That's what I thought too. These are just 50 records of millions of records as the addresses are for different policy holders. The problem is in come cases house number fields are 123a, 123a-112, 123, a123 etc

So I don't have any way to differentiate between whats write and whats wrong in the code 🙁

Sigerson Hall of Fame Points: 3697 More actions · Answer 4

You're going to have to do it in stages, build some temp tables, and start plowing through it. As you get individual rows sorted, place them into a holding table. Also, I'd put an integer primary key on the temp tables to join As so you can easily separate what's been sorted and what hasn't.

For the '123a' or '123-A' scenarios, once you know the position of the 1st non-digit, then look to see if the first character BEFORE or AFTER it is a space.

Sigerson

"No pressure, no diamonds." - Thomas Carlyle

Jeff Moden SSC Guru Points: 1004701 More actions · Answer 5

You could spend an awful lot of money trying to develop this and still make some terrible mistakes that you might not ever find.

My recommendation is to simply buy some CASS certification software. It might be a couple or even several hundred dollars but that's nothing if you divide that by how much the company is paying you by the hour and how much time you're going to spend on writing something homegrown.

The CASS certification program will also do something that your homegrown solution cannot. It well play your list of addresses against actual address ranges and tell you if an address is invalid even if it's in perfect format.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Passionate SQL Server Developer SSC Veteran Points: 287 More actions · Answer 6

Thanks for your suggestions guys. I think it doesn't make any sense for me to spend that much time on this issue since it will take too much time. I will ask my modeler to look for a work around.