remove spaces more than double

Question

Post reply

remove spaces more than double

raym

SSCrazy

Points: 2213
More actions
July 14, 2008 at 9:03 am

#381655

IF OBJECT_ID('tempdb..#test') IS NOT NULL
DROP TABLE #test
CREATE TABLE #test (Descr Varchar(100))
GO
INSERT #TEST VALUES ('GAME WEST STR BAN CA')
INSERT #TEST VALUES ('S*AME WEST STREET 160BAN CA')
INSERT #TEST VALUES ('L.J. ARMSTRONG SEL HUNGWE')
INSERT #TEST VALUES ('S*L.J. ARMSTRONG HUNGWE CA')
INSERT #TEST VALUES ('LEYS BUTCHERY HUNGWE CA')
INSERT #TEST VALUES ('LEYS BUTCHERY HUNGWE CA')
INSERT #TEST VALUES ('MAK STON HUNGWE CA')
INSERT #TEST VALUES ('MAKRO STON HUNGWE CA')
INSERT #TEST VALUES ('CA_GP_SR_DRA SB45052 CA')
INSERT #TEST VALUES ('S*PNP MID MID CA')
INSERT #TEST VALUES ('CA_RZ_SR_TANIA UI8456 CA')
select * from #test
I HAVE TRIED TO REPLACE THE DOUBLE SPACES USING THE REPLACE FUNCTION BUT THE SPACES ARE NOT CONSISTENT
HOW CAN I REPLACE ALL DOUBLE AND MORE THAN DOUBLE AND LEAVE ONLY ONE SPACE ?

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

Do you want to update the table, or do you want to remove the spaces in a query?

If you want to update the table, here's one way:

set nocount on

update #test

set descr = replace(descr, ' ', ' ') -- 2 spaces to 1 space

where descr like '% %'

while @@rowcount > 0

update #test

set descr = replace(descr, ' ', ' ') -- 2 spaces to 1 space

where descr like '% %'

That will loop through till all are replaced. 10 iterations of the While loop will get rid of 1024 spaces in a row, since it works on powers of two. If you need to run more than that, copy the code, make the initial string in the replace commands be 5 spaces, and run that. Then run for 2 spaces. Or make the first run be 10 spaces. But 2 should do it.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Amol Patankar Grasshopper Points: 21 More actions · Answer 2

Here is a funtion that would help you clean out this list...

SET QUOTED_IDENTIFIER OFF

GO

SET ANSI_NULLS OFF

GO

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[fn_CleanWhiteSpace]') and xtype in (N'FN', N'IF', N'TF'))

drop function [dbo].[fn_CleanWhiteSpace]

GO

CREATE FUNCTION dbo.fn_CleanWhiteSpace

(

@string varchar(max)

)

RETURNS

varchar(max)

AS

BEGIN

DECLARE @StringComponents TABLE

(

String varchar(max)

)

declare @i int

declare @temp varchar(max)

if rtrim(isnull(@string, '')) = ''

set @string = '*'

if @string <> '*'

set @i = 1

set @temp = ''

while (@i <= len(@string))

begin

if substring(@string, @i, 1) = ' '

begin

if ltrim(rtrim(@temp)) <> ''

insert into @StringComponents(String)

values (ltrim(rtrim(@temp)))

set @temp = ''

end

else

set @temp = @temp + substring(@string, @i, 1)

if @i = len(@string)

if ltrim(rtrim(@temp)) <> ''

insert into @StringComponents(String)

values (ltrim(rtrim(@temp)))

set @i = @i + 1

end

DECLARE @cleanString varchar(max)

select @cleanString = isnull(@cleanString + ' ','') + q.String from

(

select String FROM @StringComponents

) q

RETURN @cleanString

END

GO

SET QUOTED_IDENTIFIER OFF

GO

SET ANSI_NULLS ON

GO

You can now just use this function to get your cleaned set:

IF OBJECT_ID('tempdb..#test') IS NOT NULL

DROP TABLE #test

CREATE TABLE #test (Descr Varchar(100))

GO

INSERT #TEST VALUES ('GAME WEST STR BAN CA')

INSERT #TEST VALUES ('S*AME WEST STREET 160BAN CA')

INSERT #TEST VALUES ('L.J. ARMSTRONG SEL HUNGWE')

INSERT #TEST VALUES ('S*L.J. ARMSTRONG HUNGWE CA')

INSERT #TEST VALUES ('LEYS BUTCHERY HUNGWE CA')

INSERT #TEST VALUES ('MAK STON HUNGWE CA')

INSERT #TEST VALUES ('MAKRO STON HUNGWE CA')

INSERT #TEST VALUES ('CA_GP_SR_DRA SB45052 CA')

INSERT #TEST VALUES ('S*PNP MID MID CA')

INSERT #TEST VALUES ('CA_RZ_SR_TANIA UI8456 CA')

select dbo.fn_CleanWhiteSpace(Descr) from #test

RyanRandall SSChampion Points: 13623 More actions · Answer 3

Another technique which can be useful in some circumstances...

update #TEST set Descr = replace(replace(replace(Descr, ' ', '¬ '), ' ¬', ''), '¬', '') where Descr like '% %'

(Where ¬ is a character not in your table. You can use another character if your table contains this one. If you can't be sure of a particular character not being in there, you can't use this method. You also need to be wary of 'having room' for that first replace since it will make the string longer.)

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

raym SSCrazy Points: 2213 More actions · Answer 4

raym

SSCrazy

Points: 2213

July 15, 2008 at 3:39 am

#842900

Thank you...It works

sgmunson SSC Guru Points: 110640 More actions · Answer 5

There is a potential pitfall here, folks: Any time there are exactly X spaces between words, where X is greater than 2, you have a problem with any technique OTHER than replacing X spaces with one, because if there are exactly X spaces between a given pair of words, a replace with nothing will merge the words. Also, with X > 2, you may not catch all the extra spaces. Keep that in mind for any replacement technique.

I much prefer the shortest amount of code to do this, and the idea behind a repeated loop to replace all occurrences of two spaces with one until there aren't any more such occurrences. It's efficient, fairly fast, and it won't merge words.

Steve

:):):)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 6

RyanRandall (7/14/2008)
Another technique which can be useful in some circumstances...
update #TEST set Descr = replace(replace(replace(Descr, ' ', '¬ '), ' ¬', ''), '¬', '') where Descr like '% %'
(Where ¬ is a character not in your table. You can use another character if your table contains this one. If you can't be sure of a particular character not being in there, you can't use this method. You also need to be wary of 'having room' for that first replace since it will make the string longer.)

I've seen you do this before and I gotta say it again... VERY clever!

You could use an "invisible" character like CHAR(1) for this... most character columns never have such a thing.

I see your warning about "room" but have you tried it with some text that approaches 8000 characters to see if the expansion of single spaces to two spaces causes a problem? Just curious...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RyanRandall SSChampion Points: 13623 More actions · Answer 7

Jeff Moden (7/16/2008)
RyanRandall (7/14/2008)
Another technique which can be useful in some circumstances...
update #TEST set Descr = replace(replace(replace(Descr, ' ', '¬ '), ' ¬', ''), '¬', '') where Descr like '% %'
(Where ¬ is a character not in your table. You can use another character if your table contains this one. If you can't be sure of a particular character not being in there, you can't use this method. You also need to be wary of 'having room' for that first replace since it will make the string longer.)
I've seen you do this before and I gotta say it again... VERY clever!
You could use an "invisible" character like CHAR(1) for this... most character columns never have such a thing.

Thanks Jeff 🙂

I stole the idea from here: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=56195

All the best ideas are stolen 😀

Of course, you could always use our old friend the tally table to find a suitable character...

declare @x char(1)

select @x = char(min(N)) from (

select N from dbo.Tally cross join #TEST where N < 255

group by N having max(charindex(char(N), Descr)) = 0) a

update #TEST set Descr = replace(replace(replace(Descr, ' ', @x + ' '), ' ' + @x, ''), @x, '') where Descr like '% %'

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

sgmunson SSC Guru Points: 110640 More actions · Answer 8

Holy Alphabet, Batman !!!

Ryan, that's one of the slickest routines I've ever seen... Neato, peachy keen, cool, etc., etc. ...

WELL DONE !!!

Steve

(aka smunson)

:):):)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

RyanRandall SSChampion Points: 13623 More actions · Answer 9

smunson (7/16/2008)
Holy Alphabet, Batman !!!
Ryan, that's one of the slickest routines I've ever seen... Neato, peachy keen, cool, etc., etc. ...
WELL DONE !!!
Steve
(aka smunson)
:):):)

Hehe - you've made my day 🙂

Thanks Steve!

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

GSquared SSC Guru Points: 260824 More actions · Answer 10

Here's why I suggested what I did, with taking larger chunks first, then smaller.

create table #T (

String1 varchar(1000),

String2 varchar(1000),

String3 varchar(1000))

insert into #t(string1)

select 'A' + replicate(' ', (abs(checksum(newid()))%100)+1)

+ 'B' + replicate(' ', (abs(checksum(newid()))%100)+1)

+ 'C' + replicate(' ', (abs(checksum(newid()))%100)+1)

+ 'D' + replicate(' ', (abs(checksum(newid()))%100)+1)

+ 'E' + replicate(' ', (abs(checksum(newid()))%100)+1)

+ 'F'

from dbo.numbers n1

cross join dbo.numbers n2

where n1.number between 1 and 1000

and n2.number between 1 and 1000

update #t

set string2 = string1, string3 = string1

set nocount on

print 'Test1'

set statistics time on

update #t

set string1 = replace(string1, ' ', ' ') -- 2 spaces to 1 space

where string1 like '% %'

while @@rowcount > 0

update #t

set string1 = replace(string1, ' ', ' ') -- 2 spaces to 1 space

where string1 like '% %'

print replicate('=', 25) + 'Test2'

update #T set string2 = replace(replace(replace(string2, ' ', '¬ '), ' ¬', ''), '¬', '') where string2 like '% %'

print replicate('=', 25) + 'Test3'

update #t

set string1 = replace(string3, replicate(' ', 50), ' ') -- 50 spaces to 1 space

where string1 like '% %'

update #t

set string1 = replace(string3, replicate(' ', 25), ' ') -- 25 spaces to 1 space

where string1 like '% %'

update #t

set string1 = replace(string3, replicate(' ', 12), ' ') -- 12 spaces to 1 space

where string1 like '% %'

update #t

set string1 = replace(string3, replicate(' ', 6), ' ') -- 6 spaces to 1 space

where string1 like '% %'

update #t

set string1 = replace(string3, ' ', ' ') -- 2 spaces to 1 space

where string1 like '% %'

while @@rowcount > 0

update #t

set string1 = replace(string3, ' ', ' ') -- 2 spaces to 1 space

where string1 like '% %'

Test1, which uses a simple loop to consolidate 2 spaces at a time into 1 space, took 304,352 milliseconds total time to execute.

Test2, using the nested chain replace, took 98,880 total (much better)

Test3, using staged replaces and a final loop, took 2,491 total.

After that, I realized there could be another test.

(Same table, same data generation, new data though, so not entirely scientific, but probably close enough.)

update #T

set string1 =

replace(

replace(string1, replicate(' ', 50), '¬ '),

replicate(' ', 25), '¬ '),

replicate(' ', 12), '¬ '),

replicate(' ', 6), '¬ '),

replicate(' ', 2), '¬ '),

' ¬', ''),

'¬', '')

Total time 43,694 milliseconds. Faster than the simpler nested replace, but it doesn't actually work, because it left a certain number of double-spaces. (Someone might be able to rewrite it to get the final job done, but at the speed difference between that and the staged updates, I'm not going to bother.)

The staged updates is almost 40 times faster in this test than the nested replaces. The nested replaces looks more clever, but that speed difference, I'll go with the less elegant solution.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 11

smunson (7/16/2008)
There is a potential pitfall here, folks: Any time there are exactly X spaces between words, where X is greater than 2, you have a problem with any technique OTHER than replacing X spaces with one, because if there are exactly X spaces between a given pair of words, a replace with nothing will merge the words. Also, with X > 2, you may not catch all the extra spaces. Keep that in mind for any replacement technique.
I much prefer the shortest amount of code to do this, and the idea behind a repeated loop to replace all occurrences of two spaces with one until there aren't any more such occurrences. It's efficient, fairly fast, and it won't merge words.
Steve
:):):)

Since I just tested the other solution on 1-million rows of 5 random lengths of multi-space strings each, I can say that the clever solution proposed in this thread did not produce any wrong results. No double-spaces, and all the expected single-spaces were intact. So, I wouldn't worry about that in this case.

The clever solution just isn't fast enough for me. It is just fine on accuracy.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

sgmunson SSC Guru Points: 110640 More actions · Answer 12

GSquared,

The reason I wasn't all that worried about performance is because this kind of fixing is generally a one-time event, and even if it took an hour, the overall net result of accuracy FAR outweighs performance for one-time fixes of this nature. Also, I have yet to come across millions of rows of that kind of data problem. Does anyone know of any truly large DB's with this kind of poorly entered info that has persisted for any length of time? Failing same, I'll go for accuracy first, performance 2nd - at least in this case anyway... It just seems to me to be a rather small price to pay, and without accuracy, one could be left with the same problem and just have to adopt the more accurate solution later anyway. Just my two cents...

Steve

(aka smunson)

:):):)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

RyanRandall SSChampion Points: 13623 More actions · Answer 13

As usual, different methods will be better in different situations. They're all tools for the toolbox 🙂

GSquared, your last idea also cropped up in 2002...

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=21511

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(colname,

REPLICATE(' ',21),' '),

REPLICATE(' ', 6),' '),

REPLICATE(' ', 3),' '),

REPLICATE(' ', 2),' '),

REPLICATE(' ', 2),' ')

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

GSquared SSC Guru Points: 260824 More actions · Answer 14

smunson (7/16/2008)
GSquared,
The reason I wasn't all that worried about performance is because this kind of fixing is generally a one-time event, and even if it took an hour, the overall net result of accuracy FAR outweighs performance for one-time fixes of this nature. Also, I have yet to come across millions of rows of that kind of data problem. Does anyone know of any truly large DB's with this kind of poorly entered info that has persisted for any length of time? Failing same, I'll go for accuracy first, performance 2nd - at least in this case anyway... It just seems to me to be a rather small price to pay, and without accuracy, one could be left with the same problem and just have to adopt the more accurate solution later anyway. Just my two cents...
Steve
(aka smunson)
:):):)

Since all listed solutions are completely accurate, and one is much faster than the others, what's wrong with picking the faster one?

Assuming it's only a few rows, with no more than 10 spaces to be replaced per row, the easiest solution is the simple while loop, the most elegant is the nested replaces, and you say you're not worried about the fastest.

Being willing to allow a 1-hour table-lock in a production database, I would shy away from that idea. Sure, it could be done overnight, if you don't have significant traffic at that time, but if that's also when you're running backups, warehousing data to analysis cubes, updating table statistics, etc., I'd still much prefer a 2-second table lock to a 1-hour table lock.

It's not a question of one solution being more accurate than another. They all produce perfect results (except my last attempt, which I already mentioned I abandoned because of that, and which was 2nd place in performance even so).

On the question of large databases with dirty data, it doesn't matter if it's persisted for any length of time. What if you have to load data like that every night from a bulk load? You'd still need a solution to the problem.

As I've mentioned many times before, I'll take the highest speed code that creates the desired results, every time, over lower speed code, even with the same results. After all, why not?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon