November 16, 2009 at 12:59 pm
I copy and pasted your exact code and deleted the CLR one. (I didn't run my function version)
I also tested on our new 2008 Server instead, my previous were from a 2005 server.
Beginning execution loop
Batch execution completed 10000 times.
SQL:Looping
1433
SQL:
353
What are your system specs? Are you Express/Enterprise/Standard?
brigzy (11/16/2009)
newjcb (11/16/2009)
@brigzyBefore I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.
Beginning execution loop
Batch execution completed 10000 times.
SQL:Looping
2706
SQL:
6206
SQL:CLR
440
--PREPARE
SET NOCOUNT ON
go
CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@FooString) > 0
SELECT @FooString = REPLACE(@FooString,' ',' ')
RETURN @FooString
END
GO
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 10000
--select * FROM #TEMP1
--select * FROM #TEMP2
--SELECTS
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
PRINT 'SQL:Looping'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
go
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
PRINT 'SQL:'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
GO
DECLARE @TheTime DATETIME
SET @TheTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)
FROM #TEMP2
PRINT 'SQL:CLR'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp
DROP TABLE #TEMP1
DROP TABLE #TEMP2
go
November 16, 2009 at 1:52 pm
It's nice, but in this test, it loses by a large margin (on my desktop machine):
if object_id(N'tempdb..#T') is null
create table #T (
ID int identity primary key,
Col1 varchar(100));
if object_id(N'tempdb..#T2') is not null
drop table #T2;
if object_id(N'tempdb..#T3') is not null
drop table #T3;
set nocount on;
if not exists
(select *
from #T)
insert into #T (Col1)
select 'a' + replicate(' ', abs(checksum(newid()))%98) + 'b'
from dbo.Numbers N1
cross join dbo.Numbers N2
where N1.Number between 1 and 1000
and N2.Number between 1 and 1000;
set statistics time on;
select replace(replace(replace(Col1, ' ', ' c'), 'c ', ''), 'c', '') as Col2
into #T2
from #T;
select
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(Col1, replicate(' ', 10), ' '),
replicate(' ', 9), ' '),
replicate(' ', 8), ' '),
replicate(' ', 7), ' '),
replicate(' ', 6), ' '),
replicate(' ', 5), ' '),
replicate(' ', 4), ' '),
replicate(' ', 3), ' '),
replicate(' ', 2), ' ') as Col3
into #T3
from #T;
set statistics time off;
select *
from #T2
where Col2 like '% %';
select *
from #T3
where col3 like '% %';
SQL Server Execution Times:
CPU time = 7859 ms, elapsed time = 7942 ms.
SQL Server Execution Times:
CPU time = 3797 ms, elapsed time = 3837 ms.
10 runs, each time, the second one came in at about half the time of the other.
Changed Col1 in #T to varchar(1000), modified the checksum modulus accordingly, and it came out at 67.304 second total, 66.375 CPU for the article method, and 15.815 total, 15.360 CPU, for the nested replaces method.
If you are dealing with truly phenomenal numbers of duplicate characters, you can nest a replicate 100 inside the replicate 10, or a replicate 1000, and so on. I've found that, with up to hundreds or low thousands of characters, 10, 9, 8, ... 3, 2, is as far as you need to go. Up to 5 is almost always enough, but I'm into overkill.
The reason for this speed advantage is simple enough. If you have 30 spaces (or whatever duplicate character you're getting rid of) in a row, then the article's method has to replace 15 of them, then replace 14 of those, then get rid of one more character, while the nested replicates method just has to replace 3 tens and 1 three, and it's done. Far fewer CPU cycles. For 9 duplicates, the nested replicates method does one replace and is done, while the alternating characters method still has to go through all three cycles. More CPU cycles, more RAM addressing, more RAM rewrites.
If someone else can get different results, where the alternating characters method is actually faster, I'd love to see the test.
Edit: Just realized the mod 98 should have had a +1 to actually fill up the column, but that's just to be more tidy. Doesn't actually affect the test.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 16, 2009 at 1:58 pm
I just got motivated and wrote two CLRs for this and tested on the same 2008 server from my above post with 100000 rows as per brigzy's example.
Beginning execution loop
Batch execution completed 100000 times.
SQL:Looping
11786
SQL:
3573
SQL:CLRLOOP
2650
SQL:CLRONCE
1873
Code for CLRLOOP
while (inData.IndexOf(" ") > -1)
{
inData = inData.Replace(" ", " ");
}
return inData;
Code for CLRONCE:
StringBuilder result = new StringBuilder();
int pos = 0;
int nextWriteStart = -1;
int length = inData.Length;
bool lastWasWhiteSpace = true;
bool buffering = false;
while (pos < length) {
if (lastWasWhiteSpace && inData[pos] == ' ' && buffering) {
result.Append(inData.Substring(nextWriteStart, (pos - nextWriteStart)));
nextWriteStart = -1;
buffering = false;
} else if (nextWriteStart == -1 && (inData[pos] != ' ' || (!lastWasWhiteSpace && inData[pos] == ' '))) {
nextWriteStart = pos;
buffering = true;
}
if (inData[pos] == ' ')
{
lastWasWhiteSpace = true;
}
else
{
lastWasWhiteSpace = false;
}
pos++;
}
if (buffering)
{
result.Append(inData.Substring(nextWriteStart, (pos - nextWriteStart)));
}
return result.ToString();
For the record, I purposely made it return a single space at the beginning and/or end because the problem description really just says to remove more than one space, not trailing and ending spaces. But a trim at the return statement won't hurt anything.
November 16, 2009 at 2:27 pm
Nadrek (11/16/2009)
I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).
When I ran into Jeff's article, my first thought was to do a recursive replace of 32 blanks to 1 blank, followed by 16 blank s to 1 blank, etc... followed by 2 blanks to 1 blank. This would be based on some maximum number of blanks I would consider reasonable.
BUT Jeff's code seems to do this even with 8000 blanks.
It's now up to the performance wizzards to declare who is the winner on one million records with one column containing 8000 blanks.
November 16, 2009 at 3:13 pm
Michael Meierruth (11/16/2009)
Nadrek (11/16/2009)
I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).
When I ran into Jeff's article, my first thought was to do a recursive replace of 32 blanks to 1 blank, followed by 16 blank s to 1 blank, etc... followed by 2 blanks to 1 blank. This would be based on some maximum number of blanks I would consider reasonable.
BUT Jeff's code seems to do this even with 8000 blanks.
It's now up to the performance wizzards to declare who is the winner on one million records with one column containing 8000 blanks.
To test 1-million at 8,000, I ran the below:
if object_id(N'tempdb..#T') is null
create table #T (
ID int identity primary key,
Col1 varchar(8000));
if object_id(N'tempdb..#T2') is not null
drop table #T2;
if object_id(N'tempdb..#T3') is not null
drop table #T3;
set nocount on;
if not exists
(select *
from #T)
insert into #T (Col1)
select 'a' + replicate(' ', abs(checksum(newid()))%7999) + 'b'
from dbo.Numbers N1
cross join dbo.Numbers N2
where N1.Number between 1 and 1000
and N2.Number between 1 and 1000;
set statistics time on;
select replace(replace(replace(Col1, ' ', ' c'), 'c ', ''), 'c', '') as Col2
into #T2
from #T;
select
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(Col1, replicate(' ', 1000), ' '),
replicate(' ', 10), ' '),
replicate(' ', 9), ' '),
replicate(' ', 8), ' '),
replicate(' ', 7), ' '),
replicate(' ', 6), ' '),
replicate(' ', 5), ' '),
replicate(' ', 4), ' '),
replicate(' ', 3), ' '),
replicate(' ', 2), ' ') as Col3
into #T3
from #T;
set statistics time off;
if exists
(select *
from #T2
where Col2 like '% %')
print 'Test 1 failed';
if exists
(select *
from #T3
where col3 like '% %')
print 'Test 2 failed';
Only change from my prior test was adding a replicate 1000 into the nested replaces.
Results:
SQL Server Execution Times:
CPU time = 530859 ms, elapsed time = 559134 ms.
SQL Server Execution Times:
CPU time = 50828 ms, elapsed time = 134154 ms.
The Alternating Characters method took over 10 times as much CPU time, and over 4 times as much total time, as the Nested Replace method. (It's a Core2 Duo, so CPU being higher than total isn't unusual.)
Tested it with a replicate 100 between the 1000 and the 10, and that took the time to:
SQL Server Execution Times:
CPU time = 44484 ms, elapsed time = 120706 ms.
Not much of an improvement, but it is better. Can be fine-tuned more, most likely. But with the speed difference I'm seeing already, I'm not too worried about making it "perfect".
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 16, 2009 at 3:33 pm
GSquared: did you try hard-coded space-strings instead of the pile of replicates? I would HOPE the optimizer would expand those on compile, but if it doesn't that should shave off some CPU ticks.
newjcb: is StringBuilder the most efficient mechanism for doing this string maniplation in CLR?? last I heard (years ago) that was a pretty bloated structure.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 16, 2009 at 3:37 pm
Sorry have been out 4 curry last few hours....:hehe:
What are your system specs? Are you Express/Enterprise/Standard?
Vista, SQL Standard 2005 SP 3 (local laptop install),
Mem 2GBm (1.5GB in use), Intel Core Duo
I still get the following using batch & C# CLR I posted earlier
Beginning execution loop
Batch execution completed 10000 times.
SQL:Looping
2520
SQL:
6040
SQL:CLR
440
C# Gnu
____________________________________________________
November 16, 2009 at 4:30 pm
I'm going to run out of 'stars' at this rate ๐
November 16, 2009 at 4:57 pm
Taking the ||*9*9|| technique down to C# is sweet
This brings the C# down to just one line of code.
If the ||*9*9|| string was reduced it could be even faster..
Here are the test results:
Beginning execution loop
Batch execution completed 10000 times.
SQL:Looping
2556
SQL:
6433
SQL:CLR looping
443
SQL:CLR Not looping using ||*9*9|| technique in C#
310
Here is the C#
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[SqlFunction()]
public static string fn_CleanStringNotLooping_CLR(string CleanMeUpScottie)
{
return CleanMeUpScottie.Replace(" "," ||*9*9||")
.Replace("||*9*9|| ","").Replace("||*9*9||","");
}
[SqlFunction()]
public static string fn_CleanStringLooping_CLR(string CleanMeUpScottie)
{
while (CleanMeUpScottie.IndexOf(" ", 0) != -1)
{
CleanMeUpScottie = CleanMeUpScottie.Replace(" ", " ");
}
return CleanMeUpScottie;
}
};
SQL Batch
--PREPARE
SET NOCOUNT ON
go
CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@FooString) > 0
SELECT @FooString = REPLACE(@FooString,' ',' ')
RETURN @FooString
END
GO
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 10000
--select * FROM #TEMP1
--select * FROM #TEMP2
--SELECTS
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
PRINT 'SQL:Looping'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
go
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
PRINT 'SQL:'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
GO
DECLARE @TheTime DATETIME
SET @TheTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanStringLooping_CLR(COL2)
FROM #TEMP2
PRINT 'SQL:CLR looping '
PRINT DATEDIFF(ms,@TheTime,GETDATE())
go
DECLARE @TheTime DATETIME
SET @TheTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanStringNotLooping_CLR(COL2)
FROM #TEMP2
PRINT 'SQL:CLR Not looping using ||*9*9|| technique in C#'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp
DROP TABLE #TEMP1
DROP TABLE #TEMP2
go
C++ anyone?
C# Gnu
____________________________________________________
November 16, 2009 at 5:10 pm
brigzy (11/16/2009)
Taking the ||*9*9|| technique down to C# is sweet!
It seems pretty obvious that a .NET solution will usually be faster. What I liked about this article was the encouragement to think set-based and apply some old-school creative thinking to an old problem. No doubt a thorough treatment of which solution is 'best' (for some well-defined value of 'best') would make an interesting article.
Paul
November 16, 2009 at 10:44 pm
Paul White (11/16/2009)
brigzy (11/16/2009)
Taking the ||*9*9|| technique down to C# is sweet!It seems pretty obvious that a .NET solution will usually be faster.
Yes we we did clarify expected/obvious:
We might have expected c# to have the day, but the Query/Function result is surprising.
I think it is quite resonable and hopefully of general interest to discuss fastest C# technique on the subject. Seeing how simple the C# code is might also encourage pure T-SQL folk to venture C#.
C# Gnu
____________________________________________________
November 17, 2009 at 12:47 am
Jeff: Neat. Whether it's the best on SQL Server or not is only partly the point (for me). I just like the different approach to the problem. Don't attack it directly, transform it to a smaller set of problems that _do_ lend themselves to a direct solution.
To the CLR freaks: The OP stated it had to be SQL 2000.
November 17, 2009 at 1:13 am
SDM (11/17/2009)
To the people who can program in a .NET language and recognise when to use it appropriately to complement T-SQL:
Fixed that for ya! ๐
SDM (11/17/2009)
The OP stated it had to be SQL 2000.
OP? It's an article! And where does it specify 2000? Read it again carefully ๐
November 17, 2009 at 1:26 am
brigzy (11/16/2009)
We might have expected c# to have the day, but the Query/Function result is surprising.
Not really, no. Using the string "||*9*9||" as a replacement has a huge hidden cost, enough to make the normally less-efficient WHILE-loop-in-a-function approach seem faster. (Let us leave aside for a second the fact that the article stipulates that the string to be processed should not be made longer at any point.)
If we use "รพ" as the replacement character, we don't change the overall length of the string. This allows SQL Server to do an in-place substitution. When using the long string, the source string has to be split at the replacement point, the "||*9*9||" inserted, and then the remainder of the original string tacked onto the end. The point is that there is an awful lot of string manipulation to be done, along with some allocations and copies, which just aren't needed with the "รพ" method.
Using the single-character method, a quick test using your code produces a run time of around 4.9s for the function, and 3.2s for the article's method.
While I'm here, I'd just like to mention that the nested-replace method mentioned by Joe Celko and illustrated by G2 is very fast. Probably not as fast as a similarly constructed .NET method, but good nonetheless. The problem with it is complexity and verbosity, at least to my mind. Simply put, it's just not elegant. I like Jeff's method for the reasons mentioned previously.
Paul
Viewing 15 posts - 61 through 75 (of 425 total)
You must be logged in to reply to this topic. Login to reply