remove spaces more than double

  • GSquared,

    My concern was with any processes that replaced any more than 2 spaces with none, which has a potential problem, as I pointed out previously.

    With regard to your example of re-occurring bulk data loads, I'd absolutely insist on cleaning up the data long before it was ready for bulk insert. The risks associated with accepting data from sources that provide it in such a "dirty" fashion often go far beyond a mere few extra spaces, and that kind of data integrity problem is no way to run ANY business. Merely cleaning up the spaces has far too much potential for either avoiding or entirely ignoring data integrity problems - and at a level no one should be willing to accept. Ever... and no matter what. And, generally, until management actually feels the pinch from an outage caused by poor data, they may continue to just go happily along believing that they can get away with running a business that way. Again, just my two cents.

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • RyanRandall (7/16/2008)


    As usual, different methods will be better in different situations. They're all tools for the toolbox 🙂

    GSquared, your last idea also cropped up in 2002...

    http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=21511

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(colname,

    REPLICATE(' ',21),' '),

    REPLICATE(' ', 6),' '),

    REPLICATE(' ', 3),' '),

    REPLICATE(' ', 2),' '),

    REPLICATE(' ', 2),' ')

    That would need to be nested inside the other replace (with the placeholder character), in case the number of spaces was a multiple that wouldn't be fixed by those exact replaces.

    Here's a test:

    create table #T (

    Number int,

    String varchar(1000))

    insert into #t (number, string)

    select number, '|' + replicate(' ', number) + '|'

    from dbo.numbers

    where number between 1 and 998

    update #t

    set string = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(string,

    REPLICATE(' ',21),' '),

    REPLICATE(' ', 6),' '),

    REPLICATE(' ', 3),' '),

    REPLICATE(' ', 2),' '),

    REPLICATE(' ', 2),' ')

    select * -- Test for wrong results

    from #t

    where string like '% %' -- Still has double-space

    or string = '||' -- All spaces eliminated

    Out of 998 rows, 120 ended up with double-spacing still in them. They were all places where it was a large number of spaces. No rows had all spaces eliminated.

    Probably not applicable in any normal situation, but worth keeping in mind. I don't like cases where it's mathematically possible to end up with bad data.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (7/16/2008)


    Out of 998 rows, 120 ended up with double-spacing still in them.

    Slow down!

    The link from 2002 says:

    I'm pretty convinced now that it requires 6 REPLACEs. I think that, using 5 REPLACEs it is possible to collapse up to 461 spaces (using the sequence* 21, 6, 3, 2, 2). With 6 REPLACEs, this increases to 53591 (prepending 231 to the sequence).

    If you take this into account it works just fine 🙂

    Thanks for doing the tests by the way. There are plenty around from over the years, but it's always interesting to see more.

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Makes sense. (I didn't actually click through the link, because I kind of overlooked it in my overenthusiam for testing.)

    On the original test table, with 1-million rows, that nested replace method took 52,671. Faster than the one with the placeholder character, but still much slower than the staged replace.

    Results were completely accurate.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks GSquared.

    By the way, I've just noticed that Test3 references string1 several times. Is that supposed to be the case?

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Nope. That comes from me editing (I changed some stuff for clarity) after I tested. It's a typo. (Par for the course today. I try to make something more clear, and it ends up with lead-like transparency!)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Try creating a function like the following:

    create function [dbo].[removeMultiSpace] (@cString varchar(300) )

    returns varchar(300)

    as

    BEGIN

    declare @k smallint, @left varchar(300), @right varchar(300)

    select @cString = ltrim(rtrim(@cString))

    select @k = len(ltrim(rtrim(@cString)))

    IF @k > 1

    BEGIN

    SELECT @k = CHARINDEX(' ',@cString)

    WHILE @k > 0

    BEGIN

    SELECT @left = SUBSTRING(@cString,1,(@k - 1))

    SELECT @right = SUBSTRING(@cString,@k,LEN(@cString))

    SET @cString = @left + ' ' + ltrim(rtrim(@right))

    SELECT @k = CHARINDEX(' ',@cString)

    END

    END

    RETURN @cString

    END

    ;

    .. and test that with the following:

    IF OBJECT_ID (N'DBO.TEMP1', N'U') IS NOT NULL DROP TABLE DBO.TEMP1;

    create table DBO.TEMP1 (COL1 varchar(30));

    insert into DBO.TEMP1 VALUES ('JOHN SMITH');

    insert into DBO.TEMP1 VALUES ('JOHN SMITH');

    insert into DBO.TEMP1 VALUES ('JOHN SMITH');

    insert into DBO.TEMP1 VALUES ('JOHN SMITH');

    SELECT DBO.removeMultiSpace(COL1) FROM DBO.TEMP1;

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply