November 16, 2009 at 9:01 am
" ||*9*9|| " solution is funtastic. Great job and thanks a lot .
November 16, 2009 at 9:09 am
😎 Thanks, Jeff!
November 16, 2009 at 9:33 am
Good solution Jeff, and well documented. However, I was a tad disappointed that you included the LTRIM(RTRIM(. That is unnecessary overhead that is unrelated to solving the stated problem (i.e. removal of multiple contiguous spaces). 😀
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 16, 2009 at 10:01 am
I starting writing this reply saying that this isn't very efficient at all...
However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.
I will submit this finding: It performed even better when compiled as an inline-function
The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.
November 16, 2009 at 10:18 am
Here are some results comparing SQL Query/SQL function/SQL C# CLR using a modified version of Thiago's script (100k records on my laptop):
SQL Query : 4 minutes (2009-11-16 17:05:20.670 -> 2009-11-16 17:09:33.190)
SQL Function : 1 min 33 seconds (2009-11-16 17:09:33.190 - > 2009-11-16 17:11:02.277)
SQL CLR C# : 14 seconds (2009-11-16 17:11:02.277 - > 2009-11-16 17:11:16.077)
We might have expected c# to have the day, but the Query/Function result is surprising.
Can anyone explain why? (is this to do with SQL optimization abilities)?
Is this test Relaible?
--PREPARE
CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@FooString) > 0
SELECT @FooString = REPLACE(@FooString,' ',' ')
RETURN @FooString
END
GO
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 100000
--select * FROM #TEMP1
--select * FROM #TEMP2
--SELECTS
SELECT GETDATE()
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
GO
SELECT GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
GO
SELECT GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)
FROM #TEMP2
GO
SELECT GETDATE()
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp
DROP TABLE #TEMP1
DROP TABLE #TEMP2
go
C# Gnu
____________________________________________________
November 16, 2009 at 10:39 am
newjcb (11/16/2009)
I starting writing this reply saying that this isn't very efficient at all...However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.
I will submit this finding: It performed even better when compiled as an inline-function
The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.
I wonder if CLR wouldn't be better to help you solve the flexible solution you require...
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 16, 2009 at 10:47 am
I wonder if CLR wouldn't be better to help you solve the flexible solution you require...
Yep, (if) my test results in my post above are correct we are looking at CLR at least 10x faster, or execution will take just one tenth of time.
I have another really good use of CLR as solution to sp_OACreate security issues - I will have to post as an article!
C# Gnu
____________________________________________________
November 16, 2009 at 10:51 am
I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.
Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).
November 16, 2009 at 10:58 am
brigzy (11/16/2009)
Here are some results comparing SQL Query/SQL function/SQL C# CLR using a modified version of Thiago's script (100k records on my laptop):SQL Query : 4 minutes (2009-11-16 17:05:20.670 -> 2009-11-16 17:09:33.190)
SQL Function : 1 min 33 seconds (2009-11-16 17:09:33.190 - > 2009-11-16 17:11:02.277)
SQL CLR C# : 14 seconds (2009-11-16 17:11:02.277 - > 2009-11-16 17:11:16.077)
We might have expected c# to have the day, but the Query/Function result is surprising.
Can anyone explain why? (is this to do with SQL optimization abilities)?
Is this test Relaible?
--PREPARE
CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@FooString) > 0
SELECT @FooString = REPLACE(@FooString,' ',' ')
RETURN @FooString
END
GO
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 100000
--select * FROM #TEMP1
--select * FROM #TEMP2
--SELECTS
SELECT GETDATE()
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
GO
SELECT GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
GO
SELECT GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)
FROM #TEMP2
GO
SELECT GETDATE()
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp
DROP TABLE #TEMP1
DROP TABLE #TEMP2
go
I got very different results from yours: I used only 10000 rows and could already see the difference.
SELECT GETDATE() == 2009-11-16 12:44:40.250
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
GO
SELECT GETDATE() == 2009-11-16 12:44:40.967
So the orginal version (which I don't like because it can drastically increase the size of the string) took 0.717 seconds
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
GO
SELECT GETDATE() == 2009-11-16 12:44:43.297
Your while loop version took: 43.297 - 40.967 = 2.33 seconds!
UPDATE #TEMP2 SET COL5= dbo.tool_CleanTextSpaces(COL2)
FROM #TEMP2
GO
SELECT GETDATE() == 2009-11-16 12:44:43.940
This is my version which uses char(7) and Ltrim & rtrim before the replaces: 43.940 - 43.297 = 0.643 seconds!
I didn't test your CLR, so it may have been faster, but without a doubt, the while loop replace will always be slower.
EDIT: I tested with 100k and got
2009-11-16 12:57:15.420
2009-11-16 12:57:26.033 -- Means ||*9.. took 11.033 seconds
2009-11-16 12:58:00.903 -- Means function While CharIndex replace version took 34.87 seconds
2009-11-16 12:58:09.117 -- Means tool_CleanTextSpaces took 8.214 Seconds
Confirms above even more.
November 16, 2009 at 11:10 am
TheSQLGuru (11/16/2009)
newjcb (11/16/2009)
I starting writing this reply saying that this isn't very efficient at all...However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.
I will submit this finding: It performed even better when compiled as an inline-function
The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.
I wonder if CLR wouldn't be better to help you solve the flexible solution you require...
It probably would. I currently only use one CLR and that is the one to enable REGEX comparision. I have been debating doing more recently...
November 16, 2009 at 11:23 am
I have run the test again and I get:
(milliseconds)
Batch execution completed 10000 times.
SQL:
6026
SQL:Looping
2700
SQL:CLR
433
It is weird as I too would have expectd looping to be slower
Here is the script with time output messages:
(please can you post your SQL function or add it to this script):
--PREPARE
SET NOCOUNT ON
go
CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@FooString) > 0
SELECT @FooString = REPLACE(@FooString,' ',' ')
RETURN @FooString
END
GO
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 10000
--select * FROM #TEMP1
--select * FROM #TEMP2
--SELECTS
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
PRINT 'SQL:'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
GO
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
PRINT 'SQL:Looping'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
go
DECLARE @TheTime DATETIME
SET @TheTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)
FROM #TEMP2
PRINT 'SQL:CLR'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp
DROP TABLE #TEMP1
DROP TABLE #TEMP2
go
C# Gnu
____________________________________________________
November 16, 2009 at 11:37 am
johan.vandamme (11/16/2009)
One other point to consider is sentence endings. Normally there are two spaces at the end of sentences (for readability, etc., and yes, not everyone uses them. So a second character is necessary to "keep double spaces following a period".
To get around this and extended characters problems, a regex solution through CLR seems to be the best generalized solution for all replace issues, not just spaces.
Added the CLR code for regex, in case anyone's interested:
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
namespace sql.clr
{
public partial class sql2K8Functions
{
[Microsoft.SqlServer.Server.SqlFunction()]
public static bool fRegExMatch(SqlString Input, SqlString Pattern)
{
Regex RegexInstance = new Regex(Pattern.ToString());
if ((Input.IsNull) || Pattern.IsNull)
{
return false;
}
else
{
return RegexInstance.IsMatch(Input.ToString());
}
}
[Microsoft.SqlServer.Server.SqlFunction()]
public static string fRegExReplace(SqlString Input, SqlString Pattern, SqlString Replacement)
{
Regex RegexInstance = new Regex(Pattern.ToString());
if (Input.IsNull || Pattern.IsNull || Replacement.IsNull)
{
return null;
}
else
{
return RegexInstance.Replace(Input.ToString(), Replacement.ToString());
}
}
}
}
In terms of speed, CLR version performs faster than built-in replace. Queries that used to take minutes to finish take only seconds now.
November 16, 2009 at 11:39 am
@brigzy
Before I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.
November 16, 2009 at 11:42 am
newjcb (11/16/2009)
@brigzyBefore I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.
Beginning execution loop
Batch execution completed 10000 times.
SQL:Looping
2706
SQL:
6206
SQL:CLR
440
--PREPARE
SET NOCOUNT ON
go
CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@FooString) > 0
SELECT @FooString = REPLACE(@FooString,' ',' ')
RETURN @FooString
END
GO
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 10000
--select * FROM #TEMP1
--select * FROM #TEMP2
--SELECTS
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
PRINT 'SQL:Looping'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
go
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
PRINT 'SQL:'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
GO
DECLARE @TheTime DATETIME
SET @TheTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)
FROM #TEMP2
PRINT 'SQL:CLR'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp
DROP TABLE #TEMP1
DROP TABLE #TEMP2
go
C# Gnu
____________________________________________________
November 16, 2009 at 11:42 am
@Jeff
Nice article! Thanks for sharing. 🙂
@brigzy
Probably try to work without the "GO N" feature (which is a feature of SSMS, not a T-SQL functionality).
I just tried a simple CLR function and Jeff's way to clean data with one table and 100,000 rows to be cleaned.
CLR (for sure, compiled as Release version):
[Microsoft.SqlServer.Server.SqlFunction(
DataAccess=DataAccessKind.None,
IsDeterministic=true,
IsPrecise=false,
SystemDataAccess=SystemDataAccessKind.None)
]
public static SqlString RemoveDoubleSpacesClr(string inData) {
while (inData.IndexOf(" ") != -1) {
inData = inData.Replace(" ", " ");
}
return inData;
}
My test:
SET NOCOUNT ON;
---==================================================================
-- Table with data to be cleaned
IF (OBJECT_ID('tempdb..#TestCleanSpaces') IS NULL)
CREATE TABLE #TestCleanSpaces
(
Id INT NOT NULL IDENTITY(1,1)
PRIMARY KEY CLUSTERED
,Data VARCHAR(8000)
);
ELSE
TRUNCATE TABLE #TestCleanSpaces;
---==================================================================
-- create some test data
WITH random (r1, r2, r3) AS
(
SELECT
ABS(BINARY_CHECKSUM(NEWID()))
,ABS(BINARY_CHECKSUM(NEWID()))
,ABS(BINARY_CHECKSUM(NEWID()))
),
Numbers (Num) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM dbo.Numbers n1
CROSS JOIN dbo.Numbers n2
)
INSERT INTO #TestCleanSpaces
-- configure count of test data
SELECT TOP(100000)
REPLICATE(LEFT(CONVERT(NCHAR(36), NEWID()), r1 % 36) + SPACE(r2 % 20), r3 % 20)
FROM Numbers n1
CROSS JOIN random
---==================================================================
-- test clr
PRINT ('---==========================================================');
PRINT ('-- CLR');
SET STATISTICS TIME ON;
SELECT
dbo.RemoveDoubleSpacesClr(Data)
FROM #TestCleanSpaces;
SET STATISTICS TIME OFF;
---==================================================================
-- test Jeff's function
PRINT ('---==========================================================');
PRINT ('-- Jeff''s function');
SET STATISTICS TIME ON;
SELECT REPLACE(
REPLACE(
REPLACE(
LTRIM(RTRIM(Data))
,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model
,CHAR(7)+' ','') --Changes the XO model to nothing
,CHAR(7),'')
FROM #TestCleanSpaces;
SET STATISTICS TIME OFF;
I get the following result:
---==========================================================
-- CLR
SQL Server Execution Times:
CPU time = 1015 ms, elapsed time = 2667 ms.
---==========================================================
-- Jeff's function
SQL Server Execution Times:
CPU time = 1125 ms, elapsed time = 2505 ms.
Both functions are almost equal independent how often I execute the script. Sometimes CLR is slightly faster, sometimes Jeff's function is faster.
Greets
Flo
Viewing 15 posts - 46 through 60 (of 425 total)
You must be logged in to reply to this topic. Login to reply