Prologue
This article was originally published on November 16th, 2009. As with anything else, improvements can be made to code and the subject of this article is no exception. Although this article is still an interesting read (I left everything after the "Introduction" heading as it originally was), I no longer recommend the method in this article because it has been soundly beaten (more than 6 times faster) for performance by another T-SQL method as demonstrated by Michael Meierruth. If you prefer to "cut to the chase", that method can be found in a post in the discussion for this article at the following URL:
http://www.sqlservercentral.com/Forums/FindPost821209.aspx
The discussion that followed this article is also fascinating and a large number of people took part in some rather wonderful testing. Some folks even took the time to create and post some CLRs to solve the same problem. All in all, the discussion makes for an incredible learning experience which is typical of the amazing community of professional people we've all grown to know and love here at SSC. I learn something new here everyday.
One of the things that came out of the discussion is the fact that the default collation can make a huge impact on performance. The the following link to the post in the discussion where Paul White demonstrates that not-so-little nuance:
http://www.sqlservercentral.com/Forums/FindPost821565.aspx
Be you Neophyte or "Ninja", thanks for being a part of this community, folks.
--Jeff Moden
Introduction
Replacing multiple spaces with a single space is an old problem. If you Google the problem, you find that most folks still resort to While Loops in functions or maybe even a Tally table or (ugh!) XML in a function to solve this seemingly complex problem. The truth is that you don't need the RBAR of a User Defined Function at all.
This article explains the "puzzle solving" behind a common set based method to replace unknown numbers of adjacent spaces with a single space by establishing and replacing simple patterns in a set based fashion.
The Problem
You have a column of data that looks something like the following...
--===== Create and populate a test table. -- This is NOT a part of the solution. DECLARE @Demo TABLE(OriginalString VARCHAR(8000)) INSERT INTO @Demo (OriginalString) SELECT ' This has multiple unknown spaces in it. ' UNION ALL SELECT 'So does this!' UNION ALL SELECT 'As does this' UNION ALL SELECT 'This, that, and the other thing.' UNION ALL SELECT 'This needs no repair.'
The goal is to convert the sections of multiple spaces of unknown length to a single space each as quickly as possible. There are some additional requirements. You can't make the data any bigger during the process because it might get too big for VARCHAR(8000) (or whatever size the column is) or you can't use VARCHAR(MAX) because you're using SQL Server 2000.
The Method Explained
I'll use the letter "O" to act as a "visible" space so you can see what I'm talking about... consider the following sets of spaces, please....
O
OO
OOO
OOOO
OOOOO
OOOOOO
OOOOOOO
OOOOOOOO
Remember the goal is to convert all of those sets of spaces to just a single space without the use of RBAR even if the RBAR is through the simple use of a UDF. I'll also state that the goal is to remove the extra spaces without making the original string any larger in the process because it may already be as large as it can be for the given datatype.
So... STEP 1 is to identify pairs of spaces. This is done by modifying the second space in each pair of spaces to be an "unlikely" character. In this case, I'll use the visible character of "X" (which isn't unlikely but serves this visual example) to represent a space that has been changed in a pair of spaces. When we replace all pairs of space "OO" with "OX", we get the following
O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX
STEP 2 is to replace all occurrences of "XO" with NOTHING...
O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX
... and that leaves us with ...
O
OX
O
OX
O
OX
O
OX
STEP 3 is to replace "X" with NOTHING...
O
OX
O
OX
O
OX
O
OX
... and that leaves us with just singles spaces everywhere...
O
O
O
O
O
O
O
O
Again... the "O"s represent unmodified spaces and the "X"s represent spaces changed to some "unlikely character" like a special non printable, almost non type-able character like ASCII 7 (the "Bell" character). It's just a matter of 3 nested REPLACE functions to handle ANY number of spaces to accomplish the puzzle we solved above. It can all be done in a single set-based query without loops or even UDF's.
Be careful which "unlikely character" you pick, though. We'll talk more about that when we get to the "Unlikely Characters and Collation" section of this article further below.
The Code
Ok... now that you know how it works, here's the code that accomplishes the 3 steps as 3 nested REPLACE's. I've included the test table I previously covered in "The Problem" section of this article just to make things easy to run. Notice that I've also added an LTRIM/RTRIM to take out any leading or trailing spaces, as well...
--===== Create and populate a test table. -- This is NOT a part of the solution. DECLARE @Demo TABLE(OriginalString VARCHAR(8000)) INSERT INTO @Demo (OriginalString) SELECT ' This has multiple unknown spaces in it. ' UNION ALL SELECT 'So does this!' UNION ALL SELECT 'As does this' UNION ALL SELECT 'This, that, and the other thing.' UNION ALL SELECT 'This needs no repair.' --===== Reduce each group of multiple spaces to a single space -- for a whole table without functions, loops, or other -- forms of slow RBAR. In the following example, CHAR(7) -- is the "unlikely" character that "X" was used for in -- the explanation. SELECT REPLACE( REPLACE( REPLACE( LTRIM(RTRIM(OriginalString)) ,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model ,CHAR(7)+' ','') --Changes the XO model to nothing ,CHAR(7),'') AS CleanString --Changes the remaining X's to nothing FROM @Demo WHERE CHARINDEX(' ',OriginalString) > 0
CleanString |
---|
This has multiple unknown spaces in it. |
So does this! |
As does this |
This, that, and the other thing. |
Unlikely Characters and Collation
Just a quick note on "unlikely characters". You do have to be really careful about what you select as an "unlikely character" for the "X" of the "OX" model we previously discussed. For example, if you have the not-so-uncommon collation of Latin1_General_CI_AI on a column and you've chosen the "unlikely character" of Thorn (þ), you could end up deleting a whole lot more than you bargained for. According to Wikipedia, "þ" (the Thorn character) still survives as 30th character of the Icelandic alphabet and appears in other alphabets, as well. Further, in many languages, it has been replaced by the "th" digraph and, in certain collations like the Latin1_General_CI_AI collation, the "þ" character and "th" are treated as equals. For example...
--===== Create and populate a test table.
-- *** NOTICE THE COLLATION SETTING ON THE STRING COLUMN. ***
DECLARE @Demo TABLE(OriginalString VARCHAR(8000) COLLATE Latin1_General_CI_AI)
INSERT INTO @Demo (OriginalString)
SELECT ' This has multiple unknown spaces in it. ' UNION ALL
SELECT ' So does this! ' UNION ALL
SELECT ' As does this' UNION ALL
SELECT 'This, that, and the other thing.'
--===== This uses a "thorn" character as the "X" of the "OX" model
SELECT LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(OriginalString,' ',' þ'),'þ ',''),'þ','')
))
FROM @Demo
Because of the equivalent treatment of the Thorn (þ) character and the "th" digraph, you end up with a most undesirable result...
CleanString |
---|
is has multiple unknown spaces in it. |
So does is! |
As does is |
is, at, and e oer ing. |
... which is nothing like what we wanted.
There's also a danger in selecting the wrong "control character" (ASCII characters 0 through 31) as the "unlikely character". I chose CHAR(7) which is a very benign character in today's world of electronics. It was designed to literally ring the bell on old Tele-Type machines and is just passed through by today's electronics.
If you chose CHAR(0), you've just chosen the "NULL" character and anything that follows it's appearance will simply disappear. Using the current test table in this article, you'd get an output that looks like the following...
CleanString |
---|
This |
So |
As |
This,that,andtheotherthing. |
If you chose CHAR(1), you've just chosen the ASCII "Start of Header" character which is still in use today. When certain electronics see this character, it can cause some very strange behavior usually resulting in failure of your code.
Another good "unlikely character" is CHAR(8) which is the ASCII "Backspace" character. It's normally never included in any type of assembled text now adays. CHAR(127) also works well because it's a left over for "Delete" from the paper tape world. It actually punches all the holes in a paper tape (7 of them) to quite literally delete a character.
For more information on the ASCII "Control Characters", please see the following URL:
http://www.lammertbies.nl/comm/info/ascii-characters.html#cont
Conclusion
To borrow a phrase from R. Barry Young's series of articles on the subject, "There must be 15 ways to lose your cursors". Even the seemingly complex task of condensing multiple adjacent spaces to a single space can be done without a loop.
When you run into a problem where you're absolutely sure there's no way to do it without a loop, have another look. There's usually a high performance way to avoid the loop.
Thanks for listening, folks.
--Jeff Moden
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".