Tame Those Strings Part 7 - Be Proper With Your Data
This is a continuation of string manipulation techniques. If you are interested, you can read the other articles,
though you do not need to read it before this one. These are mostly beginning
programming articles, but advanced T-SQL programmers may still find something useful here.
- Part 1 deals with SUBSTRING and how it can be used to extract some information
from a field of data
- Part 2 deals with CHARINDEX and how it can be used to extract some information
from a field of data when the data is delimited rather than stored in a particular format.
- Part 3 deals with REPLACE and how it can be used to remove unwanted information
from a field of data when the data is not in a known format.
- Part 4 deals with numeric conversions.
- Part 5 deals with STUFF.
- Part 6 works with concatenating results.
Introduction
I recently saw someone ask a question about how to proper case a field of data. I thought
this would be a simple update statement, but when I actually tried it, I realized there was
a bit more work involved. In the instance that I saw, the person wanted to handle a field
that contained names. The solution was a parsing solution that used cursors to find a space
before each name and then use substrings to reformat the string. This works fine in a procedural
process, but doesn't take advantage of the power of SQL Server. I decided to develop a more set
oriented approach that should handle large result sets.
The Problem
The basic problem is to turn one string into another, with the second string being formatted
in proper case. An example would be:
Source Result ------------ --------------- steve jones Steve Jones tia jones Tia Jones delaney steven jones Delaney Steven Jones kyle Kyle
Nothing special here, but notice that there are a few different cases with differing numbers
of words in each field. This is important because a simple solution that depended on all names
being two words would fail. This was what I ran into when building my solution.
Setup
I am including my script file here:ProperCase.sql. This is
the file I used to develop this solution. If you want to practice yourself, this little script
will build you a test file:
if object_id('ProperTest') Is not null drop table ProperTest go create table ProperTest ( OldName varchar( 80), NewName varchar( 80) ) go insert ProperTest select lower(contactname), lower( contactname) from Customers insert ProperTest select 'my old friend bob', 'my old friend bob' insert ProperTest select ' my old friend bill', 'my old friend bill' -- reset values -- update propertest set newname = oldname select * from ProperTest
This script will build a test table and populate it using data from the Northwind database.
Work with the Newname field for your manipulations. The last line in the script "fixes" the
Newname data to its original form.
The Solution
My first attempt at reformatting data resulted in a set-oriented solution that worked great
for two word names, but failed after that. Here is the first thing I quickly came up with:
update ProperTest set NewName = upper( substring( NewName, 1, 1)) + substring( NewName, 2, charindex( ' ', NewName)-1) + upper( substring( NewName, charindex( ' ', Newname) + 1, 1)) + substring( NewName, charindex( ' ', Newname) + 2, 80)
This works great if there are only two names in the field, but most likely your data
will be like this data and there are some names with 3 words such as "José Pedro freyre".
Note that this is the result using my first solution.
My next attempt looked at somehow looping through the result set, but how many times would I
know how to loop? This was a tough one, but I finally came up with a solution. I would
add a "flag" to my data. This flag would be used to find the beginning of each word. Since
the basic pattern that I am looking for is
When building a set-oriented solution, you have to search for common patterns among all items in
the set. When you build a WHERE clause, you often specify some common pattern such as LIKE 'SQL%'.
I decide to use the same approach. To do this, you have to choose a special character in whatever
code page or language you are using that will not occur in your data. For my purposes, I choose the
'@' symbol. I could have easily chosen &, ~ or !.
So my first step was to set my flag, which I did by replacing the spaces with '@'.
update ProperTest set NewName = replace( NewName, ' ', '@')
This sets all my data to look like this:
Newname ------------- maria@anders ana@trujillo antonio@moreno josé@pedro@freyre
Now that I have flags, I can use these to mark the position in each string that I am trying
to process.
The basic process for each string is now:
- to find the '@' symbol
- set the next character to upper case
- remove the '@' symbol since this word has been reformatted
Now in order to keep processing until all words are reformatted, I merely need to check
for the existence of rows that contain a '@'. All my processing is then limited to these rows.
The script to perform this processing is below:
-- Replace spaces with the "special" character update ProperTest set NewName = replace( NewName, ' ', '@') -- Handle case 1 - First item update ProperTest set NewName = upper( substring( ltrim( NewName), 1, 1)) + substring( ltrim( NewName), 2, 80) -- loop while there are rows with the flag while exists( select * from ProperTest where NewName like '%@%' ) begin -- Proper case the word after the flag. update ProperTest set NewName = substring( NewName, 1, charindex( '@', NewName)) + upper( substring( NewName, charindex( '@', NewName)+1, 1 )) + substring(NewName, charindex( '@', Newname)+2, 80) where Newname like '%@%' -- Remove the first flag encountered in each row update ProperTest set NewName = substring( NewName, 1, charindex( '@', Newname)-1) + ' ' + substring( NewName, charindex( '@', Newname) + 1, 80) where Newname like '%@%' end
Conclusions
This was an interesting exercise for me and forced my to spend some time integrating a
simple algorithm into a set-oriented approach. With a little adaptation, this could be easily
adapted to proper case sentences (look for period-space) or any other patterning structure.
I hope this has proven useful for some of you and as always, I welcome feedback. Please click the
"Your Opinion" link below and add your comments.
Steve Jones
©dkRanch.net June 2001