This is a continuation of string manipulation techniques. If you are interested, you can read the other articles, though you do not need to read it before this one. These are mostly beginning programming articles, but advanced T-SQL programmers may still find something useful here.
- Part 1 deals with SUBSTRING and how it can be used to extract some information from a field of data.
Introduction
It's time to talk about string manipulation. What is it? Why do it? When should we do it? All good questions and some of the answers are in Part 1. This article will work on some advanced string manipulation techniques that I used to solve another problem. If you are new, I write about specific problems, not general ones, in general :), so I hope you enjoy this and learn something.
My Problem
I hope that few of you have run into this, but given the experiences I have had, it's quite likely more than a few of you have had this problem. In one table, the developers had decided that they needed to store answers to questions asked on the website. Some of the questions that were asked contained multiple possible answers. An example of this type of questions is shown below:
1. What types of computers do you have in your company? A. Servers B. Workstations C. Laptops D. Mainframes
From this you can see that the user may choose to select 1, 2, 3, or all 4 of the answers. Not a problem, but the developers decided that the easiest way to store this information was in a table like this:
UserID Answer ------ ----------- 1 A,B,C 2 B,C 3 A,C,D 4 D
So why is this a problem? Well, how would you find all the people that answered "A"? This one is easy, you would select all users like this:
  select UserID, answer from Answer where Answer like '%A%'
No problem. Well, what about all people that answered "B" and "C"? This one is tougher; you could try
   select UserID, answer from Answer where Answer like '%B%' and '%C%'
It does not work. I want to get this result set:
UserID Answer ------ ------- 1 A,B,C 2 B,C 3 A,C,D
What you end up with is a result that looks like this:
UserID Answer ------ ------- 1 A,B,C
In order to get the result set I want, what I would have to do is union multiple queries. This really does not scale and as the number of answers grows ( our biggest had 12 answers) the queries become increasingly complex.
The Solution
The solution in this case was to "bend" that data so that it is in a more normal form. What I mean by this is to take the table above and change it to look like this:
UserID Answer ------ ------ 1 A 1 B 1 C 2 B 2 C 3 A 3 C 3 D 4 D
With the data in this form, we can more easily build a scalable and self-maintaining solution that will grow and handle new questions and answers with minimal work from the development staff and (more importantly) the DBA.
The Code
The objects involved are given below. Since this was corporate code, I have altered names and schemas slightly to protect my company. Other than name changes and some missing fields for other data, this is the same code in our database.
- Answer - Table that holds the original data
- NewAnswer - Table that holds the new data
- dbspBendAnswers - stored procedure that "bends" the data in the Answer table.
In order to extract each answer out and place it in a separate row, I decided to row process this data one row at a time rather than develop a set oriented solution. I normally frown upon this type of code, but this needed to be done quickly and was a one-time maintenance operation with a piece of code that I would have thrown away (if I was not writing this article).
To process this data, I built a cursor of the primary key from this table which was the userID field and the questionID field. I then looped through this cursor and processed each row's answer data. The processing of each row was where the string manipulation came in.
Each row can be formatted as a pattern that looks like this:
"x,y,z,...,n"
Where x is a single answer and the number of answers in this field is unknown. However, each field can also be written in the following manner.
"x" + ",y" + ",z" + ",..." + ",n"
In this format, an algorithm starts to become apparent. If I can delimit each answer by looking at the commas, then what I need to do is parse this string by finding the location of the commas. Once I do this, then the SUBSTRING function can easily be used to extract the answers
that are between the commas. If there is no comma in the string, then the assumption is that this is a single answer, so we insert the entire string as a new answer and loop to the next row of data.
In my haste to build this, I decided that I would use the charindex function to find commas and then process the string based on the commas. charindex allows you to find the numeric location in a string of another string. In other words, it returns the position in any string of
some character that you are looking for (in my case, commas). So if I had the following string:
'A,B,C'
I could run the following code:
select charindex( ',', 'A,B,C')
and get these results:
------ 2
Where 2 is the position (2nd character) in the string of the comma. One thing to be aware of is that only the first occurance of the string is found. So now I have the location of the first comma. If I include this in a SUBSTRING statement that starts
at the beginning of the string, I can extract the first answer.
answer ------ A,B,C select @c = charindex(',', answer) select @a = substring( answer, 1, @comma - 1) select @a ------- A
Now I have a single answer, how does this help me get the rest of the answers? In my algorithm I have stord the location of the first comma off. What I now do is remove both the first answer and the first comma from the string. This would result in this code and result:
select @str = substring( @str, @comma + 1, len( @str)) select @str ------ B,C
Now I repeat the process and find the next comma. Again, if I get a zero (0) back for the location of the comma, I know I have a single answer left, insert it and end the loop. I check with an IF statement if there is a comma since passing a -1 to the SUBSTRING function returns an error.
Conclusion
This is probably not the most elegant or efficient solution, but it does work. And it was very quick to develop. In fact, it took three or four times as long to write this article as it did to develop the algorithm, test it, and execute it on my data. I do think, however, that this technique will be useful in some situations and is a worthwhile addition to the DBA toolbox. I find many times when I need string manipulation, especially in adhoc queries I write. Like any other tool in SQL Server, it has a time and a place where it is helpful and effective in its application. Spend some time experimenting and learning about the CHARINDEX() function. It will probably come in handy at some point.
Here is the online reference for CHARINDEX for those of you interested.
As always, I welcome feedback and comments and hope that you learned something from this article.
Steve Jones
©2000 dkranch