July 16, 2008 at 10:47 am
GSquared,
My concern was with any processes that replaced any more than 2 spaces with none, which has a potential problem, as I pointed out previously.
With regard to your example of re-occurring bulk data loads, I'd absolutely insist on cleaning up the data long before it was ready for bulk insert. The risks associated with accepting data from sources that provide it in such a "dirty" fashion often go far beyond a mere few extra spaces, and that kind of data integrity problem is no way to run ANY business. Merely cleaning up the spaces has far too much potential for either avoiding or entirely ignoring data integrity problems - and at a level no one should be willing to accept. Ever... and no matter what. And, generally, until management actually feels the pinch from an outage caused by poor data, they may continue to just go happily along believing that they can get away with running a business that way. Again, just my two cents.
Steve
(aka smunson)
:):):)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 16, 2008 at 10:52 am
RyanRandall (7/16/2008)
As usual, different methods will be better in different situations. They're all tools for the toolbox 🙂GSquared, your last idea also cropped up in 2002...
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=21511
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(colname,
REPLICATE(' ',21),' '),
REPLICATE(' ', 6),' '),
REPLICATE(' ', 3),' '),
REPLICATE(' ', 2),' '),
REPLICATE(' ', 2),' ')
That would need to be nested inside the other replace (with the placeholder character), in case the number of spaces was a multiple that wouldn't be fixed by those exact replaces.
Here's a test:
create table #T (
Number int,
String varchar(1000))
insert into #t (number, string)
select number, '|' + replicate(' ', number) + '|'
from dbo.numbers
where number between 1 and 998
update #t
set string = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(string,
REPLICATE(' ',21),' '),
REPLICATE(' ', 6),' '),
REPLICATE(' ', 3),' '),
REPLICATE(' ', 2),' '),
REPLICATE(' ', 2),' ')
select * -- Test for wrong results
from #t
where string like '% %' -- Still has double-space
or string = '||' -- All spaces eliminated
Out of 998 rows, 120 ended up with double-spacing still in them. They were all places where it was a large number of spaces. No rows had all spaces eliminated.
Probably not applicable in any normal situation, but worth keeping in mind. I don't like cases where it's mathematically possible to end up with bad data.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 16, 2008 at 11:14 am
GSquared (7/16/2008)
Out of 998 rows, 120 ended up with double-spacing still in them.
Slow down!
The link from 2002 says:
I'm pretty convinced now that it requires 6 REPLACEs. I think that, using 5 REPLACEs it is possible to collapse up to 461 spaces (using the sequence* 21, 6, 3, 2, 2). With 6 REPLACEs, this increases to 53591 (prepending 231 to the sequence).
If you take this into account it works just fine 🙂
Thanks for doing the tests by the way. There are plenty around from over the years, but it's always interesting to see more.
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
July 16, 2008 at 11:22 am
Makes sense. (I didn't actually click through the link, because I kind of overlooked it in my overenthusiam for testing.)
On the original test table, with 1-million rows, that nested replace method took 52,671. Faster than the one with the placeholder character, but still much slower than the staged replace.
Results were completely accurate.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 16, 2008 at 12:23 pm
Thanks GSquared.
By the way, I've just noticed that Test3 references string1 several times. Is that supposed to be the case?
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
July 16, 2008 at 12:33 pm
Nope. That comes from me editing (I changed some stuff for clarity) after I tested. It's a typo. (Par for the course today. I try to make something more clear, and it ends up with lead-like transparency!)
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
July 18, 2008 at 6:07 am
Try creating a function like the following:
create function [dbo].[removeMultiSpace] (@cString varchar(300) )
returns varchar(300)
as
BEGIN
declare @k smallint, @left varchar(300), @right varchar(300)
select @cString = ltrim(rtrim(@cString))
select @k = len(ltrim(rtrim(@cString)))
IF @k > 1
BEGIN
SELECT @k = CHARINDEX(' ',@cString)
WHILE @k > 0
BEGIN
SELECT @left = SUBSTRING(@cString,1,(@k - 1))
SELECT @right = SUBSTRING(@cString,@k,LEN(@cString))
SET @cString = @left + ' ' + ltrim(rtrim(@right))
SELECT @k = CHARINDEX(' ',@cString)
END
END
RETURN @cString
END
;
.. and test that with the following:
IF OBJECT_ID (N'DBO.TEMP1', N'U') IS NOT NULL DROP TABLE DBO.TEMP1;
create table DBO.TEMP1 (COL1 varchar(30));
insert into DBO.TEMP1 VALUES ('JOHN SMITH');
insert into DBO.TEMP1 VALUES ('JOHN SMITH');
insert into DBO.TEMP1 VALUES ('JOHN SMITH');
insert into DBO.TEMP1 VALUES ('JOHN SMITH');
SELECT DBO.removeMultiSpace(COL1) FROM DBO.TEMP1;
Viewing 7 posts - 16 through 21 (of 21 total)
You must be logged in to reply to this topic. Login to reply