remove spaces more than double

  • IF OBJECT_ID('tempdb..#test') IS NOT NULL

    DROP TABLE #test

    CREATE TABLE #test (Descr Varchar(100))

    GO

    INSERT #TEST VALUES ('GAME WEST STR BAN CA')

    INSERT #TEST VALUES ('S*AME WEST STREET 160BAN CA')

    INSERT #TEST VALUES ('L.J. ARMSTRONG SEL HUNGWE')

    INSERT #TEST VALUES ('S*L.J. ARMSTRONG HUNGWE CA')

    INSERT #TEST VALUES ('LEYS BUTCHERY HUNGWE CA')

    INSERT #TEST VALUES ('LEYS BUTCHERY HUNGWE CA')

    INSERT #TEST VALUES ('MAK STON HUNGWE CA')

    INSERT #TEST VALUES ('MAKRO STON HUNGWE CA')

    INSERT #TEST VALUES ('CA_GP_SR_DRA SB45052 CA')

    INSERT #TEST VALUES ('S*PNP MID MID CA')

    INSERT #TEST VALUES ('CA_RZ_SR_TANIA UI8456 CA')

    select * from #test

    I HAVE TRIED TO REPLACE THE DOUBLE SPACES USING THE REPLACE FUNCTION BUT THE SPACES ARE NOT CONSISTENT

    HOW CAN I REPLACE ALL DOUBLE AND MORE THAN DOUBLE AND LEAVE ONLY ONE SPACE ?

  • Do you want to update the table, or do you want to remove the spaces in a query?

    If you want to update the table, here's one way:

    set nocount on

    update #test

    set descr = replace(descr, ' ', ' ') -- 2 spaces to 1 space

    where descr like '% %'

    while @@rowcount > 0

    update #test

    set descr = replace(descr, ' ', ' ') -- 2 spaces to 1 space

    where descr like '% %'

    That will loop through till all are replaced. 10 iterations of the While loop will get rid of 1024 spaces in a row, since it works on powers of two. If you need to run more than that, copy the code, make the initial string in the replace commands be 5 spaces, and run that. Then run for 2 spaces. Or make the first run be 10 spaces. But 2 should do it.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Here is a funtion that would help you clean out this list...

    SET QUOTED_IDENTIFIER OFF

    GO

    SET ANSI_NULLS OFF

    GO

    if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[fn_CleanWhiteSpace]') and xtype in (N'FN', N'IF', N'TF'))

    drop function [dbo].[fn_CleanWhiteSpace]

    GO

    CREATE FUNCTION dbo.fn_CleanWhiteSpace

    (

    @string varchar(max)

    )

    RETURNS

    varchar(max)

    AS

    BEGIN

    DECLARE @StringComponents TABLE

    (

    String varchar(max)

    )

    declare @i int

    declare @temp varchar(max)

    if rtrim(isnull(@string, '')) = ''

    set @string = '*'

    if @string <> '*'

    set @i = 1

    set @temp = ''

    while (@i <= len(@string))

    begin

    if substring(@string, @i, 1) = ' '

    begin

    if ltrim(rtrim(@temp)) <> ''

    insert into @StringComponents(String)

    values (ltrim(rtrim(@temp)))

    set @temp = ''

    end

    else

    set @temp = @temp + substring(@string, @i, 1)

    if @i = len(@string)

    if ltrim(rtrim(@temp)) <> ''

    insert into @StringComponents(String)

    values (ltrim(rtrim(@temp)))

    set @i = @i + 1

    end

    DECLARE @cleanString varchar(max)

    select @cleanString = isnull(@cleanString + ' ','') + q.String from

    (

    select String FROM @StringComponents

    ) q

    RETURN @cleanString

    END

    GO

    SET QUOTED_IDENTIFIER OFF

    GO

    SET ANSI_NULLS ON

    GO

    You can now just use this function to get your cleaned set:

    IF OBJECT_ID('tempdb..#test') IS NOT NULL

    DROP TABLE #test

    CREATE TABLE #test (Descr Varchar(100))

    GO

    INSERT #TEST VALUES ('GAME WEST STR BAN CA')

    INSERT #TEST VALUES ('S*AME WEST STREET 160BAN CA')

    INSERT #TEST VALUES ('L.J. ARMSTRONG SEL HUNGWE')

    INSERT #TEST VALUES ('S*L.J. ARMSTRONG HUNGWE CA')

    INSERT #TEST VALUES ('LEYS BUTCHERY HUNGWE CA')

    INSERT #TEST VALUES ('LEYS BUTCHERY HUNGWE CA')

    INSERT #TEST VALUES ('MAK STON HUNGWE CA')

    INSERT #TEST VALUES ('MAKRO STON HUNGWE CA')

    INSERT #TEST VALUES ('CA_GP_SR_DRA SB45052 CA')

    INSERT #TEST VALUES ('S*PNP MID MID CA')

    INSERT #TEST VALUES ('CA_RZ_SR_TANIA UI8456 CA')

    select dbo.fn_CleanWhiteSpace(Descr) from #test

  • Another technique which can be useful in some circumstances...

    update #TEST set Descr = replace(replace(replace(Descr, ' ', '¬ '), ' ¬', ''), '¬', '') where Descr like '% %'

    (Where ¬ is a character not in your table. You can use another character if your table contains this one. If you can't be sure of a particular character not being in there, you can't use this method. You also need to be wary of 'having room' for that first replace since it will make the string longer.)

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Thank you...It works

  • There is a potential pitfall here, folks: Any time there are exactly X spaces between words, where X is greater than 2, you have a problem with any technique OTHER than replacing X spaces with one, because if there are exactly X spaces between a given pair of words, a replace with nothing will merge the words. Also, with X > 2, you may not catch all the extra spaces. Keep that in mind for any replacement technique.

    I much prefer the shortest amount of code to do this, and the idea behind a repeated loop to replace all occurrences of two spaces with one until there aren't any more such occurrences. It's efficient, fairly fast, and it won't merge words.

    Steve

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • RyanRandall (7/14/2008)


    Another technique which can be useful in some circumstances...

    update #TEST set Descr = replace(replace(replace(Descr, ' ', '¬ '), ' ¬', ''), '¬', '') where Descr like '% %'

    (Where ¬ is a character not in your table. You can use another character if your table contains this one. If you can't be sure of a particular character not being in there, you can't use this method. You also need to be wary of 'having room' for that first replace since it will make the string longer.)

    I've seen you do this before and I gotta say it again... VERY clever!

    You could use an "invisible" character like CHAR(1) for this... most character columns never have such a thing.

    I see your warning about "room" but have you tried it with some text that approaches 8000 characters to see if the expansion of single spaces to two spaces causes a problem? Just curious...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/16/2008)


    RyanRandall (7/14/2008)


    Another technique which can be useful in some circumstances...

    update #TEST set Descr = replace(replace(replace(Descr, ' ', '¬ '), ' ¬', ''), '¬', '') where Descr like '% %'

    (Where ¬ is a character not in your table. You can use another character if your table contains this one. If you can't be sure of a particular character not being in there, you can't use this method. You also need to be wary of 'having room' for that first replace since it will make the string longer.)

    I've seen you do this before and I gotta say it again... VERY clever!

    You could use an "invisible" character like CHAR(1) for this... most character columns never have such a thing.

    Thanks Jeff 🙂

    I stole the idea from here: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=56195

    All the best ideas are stolen 😀

    Of course, you could always use our old friend the tally table to find a suitable character...

    declare @x char(1)

    select @x = char(min(N)) from (

    select N from dbo.Tally cross join #TEST where N < 255

    group by N having max(charindex(char(N), Descr)) = 0) a

    update #TEST set Descr = replace(replace(replace(Descr, ' ', @x + ' '), ' ' + @x, ''), @x, '') where Descr like '% %'

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Holy Alphabet, Batman !!!

    Ryan, that's one of the slickest routines I've ever seen... Neato, peachy keen, cool, etc., etc. ...

    WELL DONE !!!

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • smunson (7/16/2008)


    Holy Alphabet, Batman !!!

    Ryan, that's one of the slickest routines I've ever seen... Neato, peachy keen, cool, etc., etc. ...

    WELL DONE !!!

    Steve

    (aka smunson)

    :):):)

    Hehe - you've made my day 🙂

    Thanks Steve!

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Here's why I suggested what I did, with taking larger chunks first, then smaller.

    create table #T (

    String1 varchar(1000),

    String2 varchar(1000),

    String3 varchar(1000))

    insert into #t(string1)

    select 'A' + replicate(' ', (abs(checksum(newid()))%100)+1)

    + 'B' + replicate(' ', (abs(checksum(newid()))%100)+1)

    + 'C' + replicate(' ', (abs(checksum(newid()))%100)+1)

    + 'D' + replicate(' ', (abs(checksum(newid()))%100)+1)

    + 'E' + replicate(' ', (abs(checksum(newid()))%100)+1)

    + 'F'

    from dbo.numbers n1

    cross join dbo.numbers n2

    where n1.number between 1 and 1000

    and n2.number between 1 and 1000

    update #t

    set string2 = string1, string3 = string1

    set nocount on

    print 'Test1'

    set statistics time on

    update #t

    set string1 = replace(string1, ' ', ' ') -- 2 spaces to 1 space

    where string1 like '% %'

    while @@rowcount > 0

    update #t

    set string1 = replace(string1, ' ', ' ') -- 2 spaces to 1 space

    where string1 like '% %'

    print replicate('=', 25) + 'Test2'

    update #T set string2 = replace(replace(replace(string2, ' ', '¬ '), ' ¬', ''), '¬', '') where string2 like '% %'

    print replicate('=', 25) + 'Test3'

    update #t

    set string1 = replace(string3, replicate(' ', 50), ' ') -- 50 spaces to 1 space

    where string1 like '% %'

    update #t

    set string1 = replace(string3, replicate(' ', 25), ' ') -- 25 spaces to 1 space

    where string1 like '% %'

    update #t

    set string1 = replace(string3, replicate(' ', 12), ' ') -- 12 spaces to 1 space

    where string1 like '% %'

    update #t

    set string1 = replace(string3, replicate(' ', 6), ' ') -- 6 spaces to 1 space

    where string1 like '% %'

    update #t

    set string1 = replace(string3, ' ', ' ') -- 2 spaces to 1 space

    where string1 like '% %'

    while @@rowcount > 0

    update #t

    set string1 = replace(string3, ' ', ' ') -- 2 spaces to 1 space

    where string1 like '% %'

    Test1, which uses a simple loop to consolidate 2 spaces at a time into 1 space, took 304,352 milliseconds total time to execute.

    Test2, using the nested chain replace, took 98,880 total (much better)

    Test3, using staged replaces and a final loop, took 2,491 total.

    After that, I realized there could be another test.

    (Same table, same data generation, new data though, so not entirely scientific, but probably close enough.)

    update #T

    set string1 =

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(string1, replicate(' ', 50), '¬ '),

    replicate(' ', 25), '¬ '),

    replicate(' ', 12), '¬ '),

    replicate(' ', 6), '¬ '),

    replicate(' ', 2), '¬ '),

    ' ¬', ''),

    '¬', '')

    Total time 43,694 milliseconds. Faster than the simpler nested replace, but it doesn't actually work, because it left a certain number of double-spaces. (Someone might be able to rewrite it to get the final job done, but at the speed difference between that and the staged updates, I'm not going to bother.)

    The staged updates is almost 40 times faster in this test than the nested replaces. The nested replaces looks more clever, but that speed difference, I'll go with the less elegant solution.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • smunson (7/16/2008)


    There is a potential pitfall here, folks: Any time there are exactly X spaces between words, where X is greater than 2, you have a problem with any technique OTHER than replacing X spaces with one, because if there are exactly X spaces between a given pair of words, a replace with nothing will merge the words. Also, with X > 2, you may not catch all the extra spaces. Keep that in mind for any replacement technique.

    I much prefer the shortest amount of code to do this, and the idea behind a repeated loop to replace all occurrences of two spaces with one until there aren't any more such occurrences. It's efficient, fairly fast, and it won't merge words.

    Steve

    :):):)

    Since I just tested the other solution on 1-million rows of 5 random lengths of multi-space strings each, I can say that the clever solution proposed in this thread did not produce any wrong results. No double-spaces, and all the expected single-spaces were intact. So, I wouldn't worry about that in this case.

    The clever solution just isn't fast enough for me. It is just fine on accuracy.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared,

    The reason I wasn't all that worried about performance is because this kind of fixing is generally a one-time event, and even if it took an hour, the overall net result of accuracy FAR outweighs performance for one-time fixes of this nature. Also, I have yet to come across millions of rows of that kind of data problem. Does anyone know of any truly large DB's with this kind of poorly entered info that has persisted for any length of time? Failing same, I'll go for accuracy first, performance 2nd - at least in this case anyway... It just seems to me to be a rather small price to pay, and without accuracy, one could be left with the same problem and just have to adopt the more accurate solution later anyway. Just my two cents...

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • As usual, different methods will be better in different situations. They're all tools for the toolbox 🙂

    GSquared, your last idea also cropped up in 2002...

    http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=21511

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(colname,

    REPLICATE(' ',21),' '),

    REPLICATE(' ', 6),' '),

    REPLICATE(' ', 3),' '),

    REPLICATE(' ', 2),' '),

    REPLICATE(' ', 2),' ')

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • smunson (7/16/2008)


    GSquared,

    The reason I wasn't all that worried about performance is because this kind of fixing is generally a one-time event, and even if it took an hour, the overall net result of accuracy FAR outweighs performance for one-time fixes of this nature. Also, I have yet to come across millions of rows of that kind of data problem. Does anyone know of any truly large DB's with this kind of poorly entered info that has persisted for any length of time? Failing same, I'll go for accuracy first, performance 2nd - at least in this case anyway... It just seems to me to be a rather small price to pay, and without accuracy, one could be left with the same problem and just have to adopt the more accurate solution later anyway. Just my two cents...

    Steve

    (aka smunson)

    :):):)

    Since all listed solutions are completely accurate, and one is much faster than the others, what's wrong with picking the faster one?

    Assuming it's only a few rows, with no more than 10 spaces to be replaced per row, the easiest solution is the simple while loop, the most elegant is the nested replaces, and you say you're not worried about the fastest.

    Being willing to allow a 1-hour table-lock in a production database, I would shy away from that idea. Sure, it could be done overnight, if you don't have significant traffic at that time, but if that's also when you're running backups, warehousing data to analysis cubes, updating table statistics, etc., I'd still much prefer a 2-second table lock to a 1-hour table lock.

    It's not a question of one solution being more accurate than another. They all produce perfect results (except my last attempt, which I already mentioned I abandoned because of that, and which was 2nd place in performance even so).

    On the question of large databases with dirty data, it doesn't matter if it's persisted for any length of time. What if you have to load data like that every night from a bulk load? You'd still need a solution to the problem.

    As I've mentioned many times before, I'll take the highest speed code that creates the desired results, every time, over lower speed code, even with the same results. After all, why not?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply