REPLACE Multiple Spaces with One

Question

Post reply

REPLACE Multiple Spaces with One

Viewing 15 posts - 46 through 60 (of 425 total)

You must be logged in to reply to this topic. Login to reply

dhananjay.n.kulkarni SSC Journeyman Points: 80 More actions · Answer 1

" ||*9*9|| " solution is funtastic. Great job and thanks a lot .

Michael Poppers SSCrazy Points: 2151 More actions · Answer 2

Michael Poppers

SSCrazy

Points: 2151

November 16, 2009 at 9:09 am

#1079743

😎 Thanks, Jeff!

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

Good solution Jeff, and well documented. However, I was a tad disappointed that you included the LTRIM(RTRIM(. That is unnecessary overhead that is unrelated to solving the stated problem (i.e. removal of multiple contiguous spaces). 😀

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

newjcb SSC Enthusiast Points: 153 More actions · Answer 4

I starting writing this reply saying that this isn't very efficient at all...

However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.

I will submit this finding: It performed even better when compiled as an inline-function

The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.

C# Gnu SSCertifiable Points: 5973 More actions · Answer 5

Here are some results comparing SQL Query/SQL function/SQL C# CLR using a modified version of Thiago's script (100k records on my laptop):

SQL Query : 4 minutes (2009-11-16 17:05:20.670 -> 2009-11-16 17:09:33.190)

SQL Function : 1 min 33 seconds (2009-11-16 17:09:33.190 - > 2009-11-16 17:11:02.277)

SQL CLR C# : 14 seconds (2009-11-16 17:11:02.277 - > 2009-11-16 17:11:16.077)

We might have expected c# to have the day, but the Query/Function result is surprising.

Can anyone explain why? (is this to do with SQL optimization abilities)?

Is this test Relaible?

--PREPARE

CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

RETURNS VARCHAR(max)

BEGIN

WHILE CHARINDEX(' ',@FooString) > 0

SELECT @FooString = REPLACE(@FooString,' ',' ')

RETURN @FooString

END

GO

CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

go

--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

INSERT INTO #TEMP1 (COL1)

OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

GO 100000

--select * FROM #TEMP1

--select * FROM #TEMP2

--SELECTS

SELECT GETDATE()

UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

))

FROM #TEMP2

GO

SELECT GETDATE()

UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

FROM #TEMP2

GO

SELECT GETDATE()

UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)

FROM #TEMP2

GO

SELECT GETDATE()

--CLEANUP

DROP FUNCTION dbo.fn_CleanUp

DROP TABLE #TEMP1

DROP TABLE #TEMP2

go

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

newjcb (11/16/2009)
I starting writing this reply saying that this isn't very efficient at all...
However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.
I will submit this finding: It performed even better when compiled as an inline-function
The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.

I wonder if CLR wouldn't be better to help you solve the flexible solution you require...

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

C# Gnu SSCertifiable Points: 5973 More actions · Answer 7

I wonder if CLR wouldn't be better to help you solve the flexible solution you require...

Yep, (if) my test results in my post above are correct we are looking at CLR at least 10x faster, or execution will take just one tenth of time.

I have another really good use of CLR as solution to sp_OACreate security issues - I will have to post as an article!

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

Nadrek SSC-Insane Points: 20039 More actions · Answer 8

I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.

Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).

newjcb SSC Enthusiast Points: 153 More actions · Answer 9

brigzy (11/16/2009)
Here are some results comparing SQL Query/SQL function/SQL C# CLR using a modified version of Thiago's script (100k records on my laptop):
SQL Query : 4 minutes (2009-11-16 17:05:20.670 -> 2009-11-16 17:09:33.190)
SQL Function : 1 min 33 seconds (2009-11-16 17:09:33.190 - > 2009-11-16 17:11:02.277)
SQL CLR C# : 14 seconds (2009-11-16 17:11:02.277 - > 2009-11-16 17:11:16.077)
We might have expected c# to have the day, but the Query/Function result is surprising.
Can anyone explain why? (is this to do with SQL optimization abilities)?
Is this test Relaible?
--PREPARE
CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@FooString) > 0
SELECT @FooString = REPLACE(@FooString,' ',' ')
RETURN @FooString
END
GO
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 100000
--select * FROM #TEMP1
--select * FROM #TEMP2
--SELECTS
SELECT GETDATE()
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
GO
SELECT GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
GO
SELECT GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)
FROM #TEMP2
GO
SELECT GETDATE()
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp
DROP TABLE #TEMP1
DROP TABLE #TEMP2
go

I got very different results from yours: I used only 10000 rows and could already see the difference.

SELECT GETDATE() == 2009-11-16 12:44:40.250

UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

))

FROM #TEMP2

GO

SELECT GETDATE() == 2009-11-16 12:44:40.967

So the orginal version (which I don't like because it can drastically increase the size of the string) took 0.717 seconds

UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

FROM #TEMP2

GO

SELECT GETDATE() == 2009-11-16 12:44:43.297

Your while loop version took: 43.297 - 40.967 = 2.33 seconds!

UPDATE #TEMP2 SET COL5= dbo.tool_CleanTextSpaces(COL2)

FROM #TEMP2

GO

SELECT GETDATE() == 2009-11-16 12:44:43.940

This is my version which uses char(7) and Ltrim & rtrim before the replaces: 43.940 - 43.297 = 0.643 seconds!

I didn't test your CLR, so it may have been faster, but without a doubt, the while loop replace will always be slower.

EDIT: I tested with 100k and got

2009-11-16 12:57:15.420

2009-11-16 12:57:26.033 -- Means ||*9.. took 11.033 seconds

2009-11-16 12:58:00.903 -- Means function While CharIndex replace version took 34.87 seconds

2009-11-16 12:58:09.117 -- Means tool_CleanTextSpaces took 8.214 Seconds

Confirms above even more.

newjcb SSC Enthusiast Points: 153 More actions · Answer 10

TheSQLGuru (11/16/2009)
newjcb (11/16/2009)
I starting writing this reply saying that this isn't very efficient at all...
However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.
I will submit this finding: It performed even better when compiled as an inline-function
The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.
I wonder if CLR wouldn't be better to help you solve the flexible solution you require...

It probably would. I currently only use one CLR and that is the one to enable REGEX comparision. I have been debating doing more recently...

C# Gnu SSCertifiable Points: 5973 More actions · Answer 11

I have run the test again and I get:

(milliseconds)

Batch execution completed 10000 times.

SQL:

6026

SQL:Looping

2700

SQL:CLR

433

It is weird as I too would have expectd looping to be slower

Here is the script with time output messages:

(please can you post your SQL function or add it to this script):

--PREPARE

SET NOCOUNT ON

go

CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

RETURNS VARCHAR(max)

BEGIN

WHILE CHARINDEX(' ',@FooString) > 0

SELECT @FooString = REPLACE(@FooString,' ',' ')

RETURN @FooString

END

GO

CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

go

--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

INSERT INTO #TEMP1 (COL1)

OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

GO 10000

--select * FROM #TEMP1

--select * FROM #TEMP2

--SELECTS

DECLARE @TheTime DATETIME

SELECT @TheTime= GETDATE()

UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

))

FROM #TEMP2

PRINT 'SQL:'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

GO

DECLARE @TheTime DATETIME

SELECT @TheTime= GETDATE()

UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

FROM #TEMP2

PRINT 'SQL:Looping'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

go

DECLARE @TheTime DATETIME

SET @TheTime=GETDATE()

UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)

FROM #TEMP2

PRINT 'SQL:CLR'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

--CLEANUP

DROP FUNCTION dbo.fn_CleanUp

DROP TABLE #TEMP1

DROP TABLE #TEMP2

go

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

sjsubscribe SSCrazy Points: 2080 More actions · Answer 12

johan.vandamme (11/16/2009)
One other point to consider is sentence endings. Normally there are two spaces at the end of sentences (for readability, etc., and yes, not everyone uses them. So a second character is necessary to "keep double spaces following a period".

To get around this and extended characters problems, a regex solution through CLR seems to be the best generalized solution for all replace issues, not just spaces.

Added the CLR code for regex, in case anyone's interested:

using System.Data.SqlTypes;

using Microsoft.SqlServer.Server;

using System.Text.RegularExpressions;

namespace sql.clr

{

public partial class sql2K8Functions

{

[Microsoft.SqlServer.Server.SqlFunction()]

public static bool fRegExMatch(SqlString Input, SqlString Pattern)

{

Regex RegexInstance = new Regex(Pattern.ToString());

if ((Input.IsNull) || Pattern.IsNull)

{

return false;

}

else

{

return RegexInstance.IsMatch(Input.ToString());

}

[Microsoft.SqlServer.Server.SqlFunction()]

public static string fRegExReplace(SqlString Input, SqlString Pattern, SqlString Replacement)

{

Regex RegexInstance = new Regex(Pattern.ToString());

if (Input.IsNull || Pattern.IsNull || Replacement.IsNull)

{

return null;

}

else

{

return RegexInstance.Replace(Input.ToString(), Replacement.ToString());

}

In terms of speed, CLR version performs faster than built-in replace. Queries that used to take minutes to finish take only seconds now.

newjcb SSC Enthusiast Points: 153 More actions · Answer 13

@brigzy

Before I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.

C# Gnu SSCertifiable Points: 5973 More actions · Answer 14

newjcb (11/16/2009)
@brigzy
Before I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.

Beginning execution loop

Batch execution completed 10000 times.

SQL:Looping

2706

SQL:

6206

SQL:CLR

440

--PREPARE

SET NOCOUNT ON

go

CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

RETURNS VARCHAR(max)

BEGIN

WHILE CHARINDEX(' ',@FooString) > 0

SELECT @FooString = REPLACE(@FooString,' ',' ')

RETURN @FooString

END

GO

CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

go

--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

INSERT INTO #TEMP1 (COL1)

OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

GO 10000

--select * FROM #TEMP1

--select * FROM #TEMP2

--SELECTS

DECLARE @TheTime DATETIME

SELECT @TheTime= GETDATE()

UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

FROM #TEMP2

PRINT 'SQL:Looping'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

go

DECLARE @TheTime DATETIME

SELECT @TheTime= GETDATE()

UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

))

FROM #TEMP2

PRINT 'SQL:'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

GO

DECLARE @TheTime DATETIME

SET @TheTime=GETDATE()

UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)

FROM #TEMP2

PRINT 'SQL:CLR'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

--CLEANUP

DROP FUNCTION dbo.fn_CleanUp

DROP TABLE #TEMP1

DROP TABLE #TEMP2

go

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 15

@Jeff

Nice article! Thanks for sharing. 🙂

@brigzy

Probably try to work without the "GO N" feature (which is a feature of SSMS, not a T-SQL functionality).

I just tried a simple CLR function and Jeff's way to clean data with one table and 100,000 rows to be cleaned.

CLR (for sure, compiled as Release version):

[Microsoft.SqlServer.Server.SqlFunction(

DataAccess=DataAccessKind.None,

IsDeterministic=true,

IsPrecise=false,

SystemDataAccess=SystemDataAccessKind.None)

]

public static SqlString RemoveDoubleSpacesClr(string inData) {

while (inData.IndexOf(" ") != -1) {

inData = inData.Replace(" ", " ");

}

return inData;

}

My test:

SET NOCOUNT ON;

---==================================================================

-- Table with data to be cleaned

IF (OBJECT_ID('tempdb..#TestCleanSpaces') IS NULL)

CREATE TABLE #TestCleanSpaces

(

Id INT NOT NULL IDENTITY(1,1)

PRIMARY KEY CLUSTERED

,Data VARCHAR(8000)

);

ELSE

TRUNCATE TABLE #TestCleanSpaces;

---==================================================================

-- create some test data

WITH random (r1, r2, r3) AS

(

SELECT

ABS(BINARY_CHECKSUM(NEWID()))

,ABS(BINARY_CHECKSUM(NEWID()))

),

Numbers (Num) AS

(

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))

FROM dbo.Numbers n1

CROSS JOIN dbo.Numbers n2

)

INSERT INTO #TestCleanSpaces

-- configure count of test data

SELECT TOP(100000)

REPLICATE(LEFT(CONVERT(NCHAR(36), NEWID()), r1 % 36) + SPACE(r2 % 20), r3 % 20)

FROM Numbers n1

CROSS JOIN random

---==================================================================

-- test clr

PRINT ('---==========================================================');

PRINT ('-- CLR');

SET STATISTICS TIME ON;

SELECT

dbo.RemoveDoubleSpacesClr(Data)

FROM #TestCleanSpaces;

SET STATISTICS TIME OFF;

---==================================================================

-- test Jeff's function

PRINT ('---==========================================================');

PRINT ('-- Jeff''s function');

SET STATISTICS TIME ON;

SELECT REPLACE(

REPLACE(

LTRIM(RTRIM(Data))

,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model

,CHAR(7)+' ','') --Changes the XO model to nothing

,CHAR(7),'')

FROM #TestCleanSpaces;

SET STATISTICS TIME OFF;

I get the following result:

---==========================================================

-- CLR

SQL Server Execution Times:

CPU time = 1015 ms, elapsed time = 2667 ms.

---==========================================================

-- Jeff's function

SQL Server Execution Times:

CPU time = 1125 ms, elapsed time = 2505 ms.

Both functions are almost equal independent how often I execute the script. Sometimes CLR is slightly faster, sometimes Jeff's function is faster.

Greets

Flo