November 19, 2009 at 4:43 am
Paul White (11/19/2009)
Just to draw breath and summarize a bit:The fastest CLR solutions appear to be almost 20x faster than the fastest T-SQL implementation...
How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?
November 19, 2009 at 4:48 am
Michael Meierruth (11/19/2009)
Paul White (11/19/2009)
Just to draw breath and summarize a bit:The fastest CLR solutions appear to be almost 20x faster than the fastest T-SQL implementation...
How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?
Michael is right of course ... but I want I want to compare fastest SQL spaces & Tabs !
C# Gnu
____________________________________________________
November 19, 2009 at 5:35 am
Michael Meierruth (11/19/2009)
How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?
Hmmm - I missed yours when I scanned down the results list - sorry about that! :blush:
Ok, three times faster.
I must go back and try your code for myself!
November 19, 2009 at 6:06 am
Interesting. All the T-SQL methods (including Michael's) seem to suffer from the collation problem I described before. Running a 100K row test (using the script I posted before), if I use a binary collation like LATIN1_GENERAL_BIN for the Data column in the test table, I get this:
Fastest CLR: 563 ms
Michael's T-SQL: 2,172 ms
Using my default collation of LATIN1_GENERAL_CI_AS (this is my server, test database, and tempdb collation):
Fastest CLR: 563 ms (yes, exactly the same)
Michaels T-SQL: 38,859 ms
To be fair, I tried the function-wrapped version, a function-wrapped version WITH SCHEMABINDING, and in-lining Michael's code (the in-lined version was fastest, and follows):
SELECT
@Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)
replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data)),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' ')
FROM
#TestCleanSpaces;
Michael's routine went back to 2,187 ms if I added COLLATE LATIN1_GENERAL_BIN after the Data column name inside all those replaces. Clearly, something very funky is going on here. So, for the best performance using the T-SQL routines, it seems necessary to manually collate the input data to a binary collation...?!
Fast:
replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data COLLATE LATIN1_GENERAL_BIN))
Slow:
replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data COLLATE LATIN1_GENERAL_CI_AS))
I feel a Connect item coming on - this is a new one on me.
Paul
November 19, 2009 at 6:58 am
A quick stand-alone repro for the collation thing:
USE tempdb;
GO
CREATE TABLE dbo.SpaceTest
(
Data VARCHAR(4000) NOT NULL
);
GO
INSERT dbo.SpaceTest
SELECT TOP (10000)
SPACE(4000)
FROM master.sys.allocation_units A1,
master.sys.allocation_units A2,
master.sys.allocation_units A3,
master.sys.allocation_units A4;
GO
DECLARE
@Bitbucket VARCHAR(4000);
SET STATISTICS TIME ON
-- SLOW
SELECT
@Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Data COLLATE LATIN1_GENERAL_CI_AS,
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' ')
FROM
dbo.SpaceTest;
-- FAST
SELECT
@Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Data COLLATE LATIN1_GENERAL_BIN,
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' ')
FROM
dbo.SpaceTest;
SET STATISTICS TIME OFF;
DROP TABLE dbo.SpaceTest;
The other odd thing is that if I change the data type in the table from VARCHAR(4000) to CHAR(4000), the problem goes away. I'm convincing myself that this is either an optimizer bug, or some effect of the complex collation precedence rules (coercible-default and all that).
Paul
November 19, 2009 at 7:20 am
Paul, I seem to recall other topics from the years whereby BIN collation was the way to go for efficiency reasons if that was your most important criteria.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 19, 2009 at 7:44 am
Hi Paul
Would the script pick up collation sequence from master.sys.allocation_units ?
Just wondering if "sp_help SpaceTest" shows a bit collation..?
Why don't we standardise and all use the same collation sequence, life would be easier:-)
Screw
C# Gnu
____________________________________________________
November 19, 2009 at 8:42 am
Hi
I have added tab support to Michael's code by initially replacing tab with a space, which might not be best aproach.
Anyhow here is the code and the leader board:
CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV(@S VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
RETURN replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(replace(@s,CHAR(9),' '))),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' ')
END
It has however reduced performance, maybe there is a better way to achieve this in SQL.
SQL function: Replace Technique : but using replacement chars ||*9*9||
2785
SQL function :Brigzy (C#Screw) Looping
2430
SQL function: Jeff Original : single bell char
1958
SQL function: Michael Meierruth : with hard coded spaces and TAB support
484
CLR: Brigzy (C#Screw) Looping
393
CLR: Not looping using ||*9*9|| technique in C#
383
SQL function: Michael Meierruth : with hard coded spaces
324
SQL function: Michael Meierruth Original
314
SQL function: Michael Meierruth III - extra replace
312
CLR: using Jeff's single char technique in C#
282
CLR: C#Screw II - Spaces and TABS
229
CLR: Flo - Spaces and TABS - v.compact
175
CLR: Michael Meierruth - C#Screw conversion: hard coded spaces
171
CLR: Michael Meierruth - C#Screw conversion: with extra Replace
171
CLR :using JCBnew space replacer C#
167
CLR: using Flo s space replacer C#
140
CLR: using Paul White space replacer C#
114
CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS
112
CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232
112
CLR: Paul White's function with Pointers by Mark-101232
104
C# Gnu
____________________________________________________
November 19, 2009 at 9:56 am
Hang on a minute !!! :w00t:
I realised that doing all replacements / loops whatever in the real world could well be a waste of time.
This is because it is surely more lilely that the data doesn't include any double spaces! which might be true of say 95% of your data?
So we only want to go to each level of replace if we really neeed to ... so recursion seems good:
CREATE FUNCTION dbo.fn_CleanUp_Recursion_Brigzy(@Data varchar(max))
RETURNS VarChar(Max)
AS
BEGIN
IF CHARINDEX(' ',@Data) > 0
BEGIN
SET @data = REPLACE(@Data,' ',' ')
IF CHARINDEX(' ',@Data) > 0
SELECT @data = dbo.fn_CleanUp_Recursion_Brigzy(@Data)
END
RETURN @data
END
Unfortuanately our test data is FULL of double spaces so this doesn't appear high in the leader
board::sick:
SQL function: Jeff Original : single bell char
5279
SQL function: C#Screw : Recursive
2636
SQL function: Replace Technique : but using replacement chars ||*9*9||
2611
SQL function :Brigzy (C#Screw) Looping
2514
SQL function: Michael Meierruth : with hard coded spaces and TAB support
391
SQL function: Michael Meierruth Original
318
CLR: Brigzy (C#Screw) Looping
313
SQL function: Michael Meierruth III - extra replace
306
SQL function: Michael Meierruth : with hard coded spaces
305
CLR: Not looping using ||*9*9|| technique in C#
280
CLR: using Jeff's single char technique in C#
236
CLR: C#Screw II - Spaces and TABS
189
CLR: Flo - Spaces and TABS - v.compact
171
CLR :using JCBnew space replacer C#
167
CLR: Michael Meierruth - C#Screw conversion: with extra Replace
156
CLR: Michael Meierruth - C#Screw conversion: hard coded spaces
150
CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS
116
CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232
112
CLR: using Flo s space replacer C#
108
CLR: using Paul White space replacer C#
106
CLR: Paul White's function with Pointers by Mark-101232
102
However if test data contained allot less double spaces this could look very different?
:discuss:
What do you think?
Edit: Anyone got some real data to try it on?
C# Gnu
____________________________________________________
November 19, 2009 at 12:31 pm
Hi folks,
Well this is the 'last post' from me but this is what things look like if there are no double spaces in the data at all:
SQL function: Michael Meierruth : with hard coded spaces and TAB support
863
SQL function: Replace Technique : but using replacement chars ||*9*9||
858
SQL function: C#Screw : Recursive
837
SQL function: Michael Meierruth III - extra replace
728
SQL function: Michael Meierruth Original
716
SQL function: Michael Meierruth : with hard coded spaces
660
SQL function :Brigzy (C#Screw) Looping
503
SQL function: Jeff Original : single bell char
444
CLR: Michael Meierruth - C#Screw conversion: with extra Replace
317
CLR: Michael Meierruth - C#Screw conversion: hard coded spaces
307
CLR: using Jeff's single char technique in C#
298
CLR: Not looping using ||*9*9|| technique in C#
268
CLR: C#Screw II - Spaces and TABS
256
CLR: Brigzy (C#Screw) Looping
246
CLR: Flo - Spaces and TABS - v.compact
232
CLR :using JCBnew space replacer C#
230
CLR: using Paul White space replacer C#
224
CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232
221
CLR: Paul White's function with Pointers by Mark-101232
214
CLR: using Flo s space replacer C#
210
CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS
205
My own conclusions
-----------------------------------
I wouldn't like to explain some of the results above entirely - but it is nice to see Jeff has the fastest SQL, and interesting to see SQL looping just behind.
The SQL looping does very well when there are no double spaces as it drops out on the first line of code. How Jeff's SQL can be faster than that I cannot explain.
CLR should be at most least 2x faster; so only but considerably faster if there really is a hefty stack of looping/string work to do.
So there we are folks, almost full circle 🙂
C# Gnu
____________________________________________________
November 19, 2009 at 1:31 pm
Great stuff 'C# screw' - awesome work on this thread 😎
November 19, 2009 at 1:40 pm
TheSQLGuru (11/19/2009)
Paul, I seem to recall other topics from the years whereby BIN collation was the way to go for efficiency reasons if that was your most important criteria.
Oh absolutely - though I also seem to recall that the SQL collations are generally more efficient than the Windows collations on non-Unicode data - I think that's right.
BIN surely makes a difference; in fact when using CHAR (to avoid the problem in the first place) the BIN collation is 5-15% faster than my default collation. I must admit I am less than clear how sort order affects nested REPLACEs, but at least REPLACE is documented as being collation-aware so there is some sense to it.
The real issue though, is what the engine is doing when using VARCHAR with a non-BIN collation. Recall that with a BIN collation, Michael's method took 2 seconds, but with LATIN1_GENERAL_CI_AS the execution time blew out to 40 seconds! There is only a very small difference in the XML show plan (an extra non-implicit convert to VARCHAR(4000) directly on the column) but it is hard to see why that should make so much difference.
It's almost as if there are optimizations which allow in-place string replacement which are defeated when using VARCHAR with collation conversions. I can only imagine that the strings are being copied in memory for each row (maybe once per REPLACE) rather than being modified in place. The sheer number of allocations and copies might go some way to explain the difference.
I will take another quick look today, but unless something obvious turns up, or someone else out there has a brainwave, this is heading for Connect for sure.
Paul
November 19, 2009 at 1:46 pm
I am not sure that blowing out a varchar(4000) conversion for lord knows what is a 'minor plan difference'!! 🙂
I bet the memory crap is happening for every iteration of the REPLACE (perhaps with the vc4K thing thrown in for good measure). That really could be some significant CPU/memory hits.
Speaking of which, can the main tester for this give a report back on CPU utilization for the test runs? And if you feel frisky, check wait stats deltas for each run. I am wondering if we might not see something show up there indicating memory and/or cpu pressure.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 19, 2009 at 1:54 pm
TheSQLGuru (11/19/2009)
I am not sure that blowing out a varchar(4000) conversion for lord knows what is a 'minor plan difference'!! 🙂
Ha, yeah. No, I just meant that the XML show plan is visually very, very similar (the graphical plan appears identitical!) there's just one small extra node there...
Just so everyone knows, my timing results have always been CPU-only. I will admit to being lazy and using SET STATISTICS TIME instead of total_worker_time / execution_count from sys.dm_exec_query_stats or Profiler which I normally prefer.
Thanks for the other thoughts too!
Paul
November 19, 2009 at 3:05 pm
Paul White (11/19/2009)
I must admit I am less than clear how sort order affects nested REPLACEs, but at least REPLACE is documented as being collation-aware so there is some sense to it.
Of course - collation is important when comparing the string to replace. D'oh.
Viewing 15 posts - 166 through 180 (of 425 total)
You must be logged in to reply to this topic. Login to reply