This is a continuation of string manipulation techniques. If you are interested, you can read the other articles,
though you do not need to read it before this one. These are mostly beginning
programming articles, but advanced T-SQL programmers may still find something useful here.
- Part 1 deals with SUBSTRING and how it can be used to extract some information
from a field of data
- Part 2 deals with CHARINDEX and how it can be used to extract some information
from a field of data when the data is delimited rather than stored in a particular format.
Introduction
Continuing on with taming strings.
In Part 1, I worked with phone numbers to find the area codes in phone
number data. This data was in a variety of formats and frankly,
quite a mess. Since we allowed users to enter data with very little
validation, we got all sorts of invalid data as well as typos.
Well, the time finally came for us to clean and validate as much phone
data as possible, and so I decided to write about my little adventure.
The Problem
Dealing with phone number data is more than a trivial exercise,
both from the technical as well a business viewpoints. If you
limit yourself to US phone numbers, then the problem is simplified,
but still requires some work. In today's Internet, 24x7 uptime requirements,
and international applications, I have seen this become less and less of
an option, so I plan on designing the db to hold all kinds of
information. Another article will tackle the database design side of storing
phone data; this article looks at getting all the data in a standard format.
We had initially designed a single phone number field on forms as a way
to handle all types of data in an ergonomic fashion. There was no
validation of the data being entered so as to make the data entry as
smooth as possible. The result of this was phone number data that looked
like this:
1-(520)-555-5821 (904)555-1877 2711 5555458 610-555-3723 000 6195557693 11111111111111111 (111) 555-0986 181-5554761 31 471 55505 31 22555599 1-800-555-9495 415 555 3851 +11455561022 1603555775 ++492055599971 407.555.4770 555-6609 202/555-2228 248-555-8300*8263 812-555-8444x312 9999999 1800555SOFT ext. 132 512/555-2000x2563 (800) 555-2588 XT 7 none 800-555-0963 X90454
As you can see, there are a variety of formats in which data was
entered (BTW, these are not real phone numbers, I scrambled quite
a few of the numbers and made the exchanges 555. If any of them actually
work, I apologize in advance). Users have entered parenthesis, dashes, periods, +, etc.
as delimiters for various sections of the phone numbers. The spacing,
abbreviations, and formats vary from company to company (much to the
chagrin of developers) and cry out for some standardization.
However, forcing standardization during the data entry annoys and angers (
and loses) users.
My company
finally came up with a standard data entry solution with some validation
of data, but I needed to then standardize the existing data.
The technical solution to storing the data was to standardize the
formatting and then include parsing routines in the application
logic to handle the display and data entry. We decided a single
field will contain the user entered data as numerics only with all
other characters stripped out with one exception. For users that
have a phone extension, we would include an "x" between the phone
number and the extension. By standardizing on this format, all developers
can expect a certain format and apply simpler parsing rules to
format the data for display.
What String Manipulation?
Here is a problem that needs to be solved, what does this have to do
with string manipulation? After all, shouldn't the validation be occuring
in the presentation or business layers? Couldn't I use a rule or user defined
datatype to standardize the system?
Patience, patience, all will be clear soon.
In order to clean this data, I will be using string manipulation to
reformat all the existing phone data in the standard format. Regardless of future validation
and schema changes, I would still have to get the existing data into a standard
format to ensure the application works.
Now it would be nice to write one SQL statement that would rip
through the data and reformat all the data as expected. A couple
problems with this; one, it would be difficult to explain in an
article. Two, it would probably take more time to decipher than
most of you would want to spend. Three, I can't come up with one.
Since reason three will tend to override reasons one and two in anything
I write, here is what I came up with.
Since I try to avoid row processing which is slow and the kind of thing I
usually chastise junior DBAs for implementing, I decided to
look for patterns which would enable me to develop an algorithm to
clean up this data. I made notes as I conducted the analysis so that I could
present readers with some insight into how I go about solving a problem or
developing a solution.
Analysis
The first thing I do when solving most problems is look for a pattern.
After all, computers are very good at performing repetitive actions
and SQL is especially well suited to working with a batch of data.
In my examination of the data, I found that most data was basically in one
of a few formats:
999 999-9999 (999) 999-9999 999-999-9999
and the extensions tended to look like:
x999 x.999 ext 999 ext.999
The rest of the data is pretty close to these formats with some
variations (more or less spaces, +xx for international numbers, some
people entered slashes ("/") or periods (".") or some other
character, but since I may end up handling these individually or
as "one-offs" (things that do not fit a pattern or are an exception),
I will ignore these for now.
The format that I am trying to standardize all data into is the following:
9999999999x9999
where all non-numeric characters and spaces are removed. Actually
spaces are not important for the parsing routines we employ, so
I will ignore embedded spaces, but I think you will be able
to adapt my solution easily to remove them.
Back to the analysis...
There are multiple places in the field where each non-numeric
character can occur. I could write a series of substring
updates (like in Part 1), but this
would be quite a bit of work, lots and lots of statements, and probably not of
interest to any of you. As I examine the data, what I really want
to do is remove all the dashes ( or other characters) from the field.
So how can we remove the parenthesis, dashes, and periods from the
phone numbers? One way that I think should work is a simple CONVERT or
CAST of the field to an integer value. If I run the following,
select cast( '(555) 555-5555' as int)
I would expect to get:
5555555555
but instead, I get:
Server: Msg 245, Level 16, State 1, Line 1
Syntax error converting the varchar value '(555) 555-55555' to a column
of data type int.
Why this occurs, I am not sure (if you are on the SQL Server team,
let me know), but it does not. SUBSTRING is not a great option, so I decided
to search BOL a little and see what my options were.
The Solution
The first thing I saw in BOL which looked promising was PATINDEX
which lets you find the location of a particular string within another
string. This along with CHARINDEX (see Part 2)
is useful for finding the location
of a pattern within a string. I suppose I could have used this and then written a
bunch of substrings that would concatenate the before and after portions of the string,
but this did not seem very efficient either.
It was around this time that one of our senior developers was doing some data cleaning
and sent me a script to execute on the production database. This script contained a series
of SQL statements in a batch that ran REPLACE to change the wording in a number of rows in
a particular table. The idea then came to me that I could use this function to quickly remove all
the non-numeric characters in my data!
The Code
The REPLACE function is very powerful and is used to do exactly what you would expect it to do.
It searches a string for some other string that you supply (much like CHARINDEX and PATINDEX), but
replaces the second string with a third string that you supply. So if you ran the following code:
declare @a char( 30) select @a = 'Bob and Jim played catch.' select @a, replace( @a, 'Jim', 'Jeff')
You get
--------------------------- --------------------------- Bob and Jim played catch. Bob and Jeff played catch.
As you can see, the first string, @a, is searched for any occurence of
the second string, "Jim". If an occurence is found, then it is replaced with
the third string, "Jeff".
To apply this to my phone number problem, I decided to write a series of statements
that would replace the characters I wanted removed. The three obvious characters I
wanted to remove were the dashes, paranthesis, and periods. Actually these are four characters
and I ended up with the following script:
update UserInfo set phone = replace( phone, '-', '') update UserInfo set phone = replace( phone, '(', '') update UserInfo set phone = replace( phone, ')', '') update UserInfo set phone = replace( phone, '.', '')
One thing that many people do not know about many functions (at least I am surprised
how many people do not know) is that you can often use an empty string wherever a character
parameter is called for. In this case, substituting an empty string as the replacement character
allows me to remove offending characters.
Conclusions
There are more possible solutions for solving this problem, but some are better
left to another article. I hope this has helped some of you understand how to manipulate
and clean data a little better as well as sparked some ideas for even better
solutions. My solution worked well and was set based so it was somewhat efficient as well.
For those of you interested in further information about REPLACE, here is the URL
online reference for REPLACE
As always, I welcome feedback and please rate this article below (and any you read on Swynk).
It helps to motivate and assist us authors in writing better columns.
Steve Jones
December 2000