November 18, 2009 at 7:18 am
Thought I'd tell you, you guys are just fun to watch. 😉
---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]
"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."
November 18, 2009 at 8:01 am
Jeff Moden (11/17/2009)
You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.
Not likely, perhaps... but very common to have about that many if someone started out with a CHAR(8000), even worse if they did concatenation on a CHAR(8000)+ another VARCHAR or CHAR + LEFT() or RIGHT() to get back to 8000. Some forms of data import also lead to excessive space padding. Are they good SQL? No. Do they happen? I believe they do.
Also, yes, GSquared and I do apparently use some similar methods.
November 18, 2009 at 8:11 am
Nadrek (11/18/2009)
Jeff Moden (11/17/2009)
You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.
Not likely, perhaps... but very common to have about that many if someone started out with a CHAR(8000), even worse if they did concatenation on a CHAR(8000)+ another VARCHAR or CHAR + LEFT() or RIGHT() to get back to 8000. Some forms of data import also lead to excessive space padding. Are they good SQL? No. Do they happen? I believe they do.
Also, yes, GSquared and I do apparently use some similar methods.
I actually discovered this method almost 20 years ago, when I was building WordPerfect macros to clean up documents that had been typed up in an ASCII text editor. Used it to clean up spaces, underlines, even page-lines created by entering arbitrary numbers of hyphens. Came in very, very handy.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 18, 2009 at 8:15 am
Ok folks,
1/everyone's SQL is now wrapped in functions
2/modifed SQL batch to capture results to table
3/added Michael Meierruth version with hard coded spaces
4/converted Michael's version to C# CLR
Here are the results:
(you will like this Paul - some variation on execution)
Beginning execution loop
Batch execution completed 10000 times.
SQL function: Replace Technique : but using replacement chars ||*9*9||
2776
SQL function :Brigzy (C#Screw) Looping
2596
SQL function: Jeff Original : single bell char
1996
CLR: Brigzy (C#Screw) Looping
453
SQL function: Michael Meierruth Original
333
SQL function: Michael Meierruth : with hard coded spaces
316
CLR: Not looping using ||*9*9|| technique in C#
290
CLR: using Jeff's single char technique in C#
250
CLR :using JCBnew space replacer C#
166
CLR: Michael Meierruth - C#Screw conversion: hard coded spaces
160
CLR: using Flo s space replacer C#
116
CLR: using Paul White space replacer C#
113
Note that Jeffs SQL has moved up now it is an SQL function
C# Gnu
____________________________________________________
November 18, 2009 at 9:53 am
For interest I improved test:
1/ ramped up workload from 10k rows to 50k rows,
2/ each function tested 10 times in succession, average the result:
Average exec times (50krows):
SQL function: Replace Technique : but using replacement chars ||*9*9||
13490
SQL function :Brigzy (C#Screw) Looping
12337
SQL function: Jeff Original : single bell char
9340
CLR: Brigzy (C#Screw) Looping
1641
SQL function: Michael Meierruth : with hard coded spaces
1562
SQL function: Michael Meierruth Original
1559
CLR: Not looping using ||*9*9|| technique in C#
1449
CLR: using Jeff's single char technique in C#
1187
CLR :using JCBnew space replacer C#
818
CLR: Michael Meierruth - C#Screw conversion: hard coded spaces
753
CLR: using Flo s space replacer C#
556
CLR: using Paul White space replacer C#
532
SQL F.Y.I.
--PREPARE
SET NOCOUNT ON
go
CREATE FUNCTION dbo.fn_CleanUp_Brigzy(@S VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@S) > 0
SELECT @s-2 = REPLACE(@S,' ',' ')
RETURN @s-2
END
go
CREATE FUNCTION dbo.fn_CleanUp_JeffOriginal(@S VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
RETURN REPLACE(
REPLACE(
REPLACE(
LTRIM(RTRIM(@S)) ,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model
,CHAR(7)+' ','') --Changes the XO model to nothing
,CHAR(7),'') --Changes the remaining X's to nothing
END
go
CREATE FUNCTION dbo.fn_CleanUp_MultiChars(@S VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
RETURN LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(@S,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
END
GO
CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKII(@S VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
RETURN replace(replace(replace(replace(replace(replace(ltrim(rtrim(@s)),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' ')
END
go
CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_Original(@S VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
--declare @s-2 varchar(8000)
--set @s-2 = '*' + replicate(' ',7998) + '*'
--select len(@s)
set @s-2 = replace(replace(replace(replace(replace(replace(ltrim(rtrim(@s)),
replicate(' ',32+1),' '),replicate(' ',16+1),' '),replicate(' ',8+1),' '),replicate(' ',4+1),' '),replicate(' ',2+1),' '),replicate(' ',1+1),' ')
--select len(@s)
RETURN @s-2
END
go
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 50000
-- Test result table
CREATE TABLE #tResults(Tag VARCHAR(100),Result int)
go
-- Test 1 : Jeff technique : multi chars for replace char
DECLARE @StartTime DATETIME
SELECT @StartTime= GETDATE()
UPDATE #TEMP2 SET COL3 = dbo.fn_CleanUp_MultiChars(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES ('SQL function: Replace Technique : but using replacement chars ||*9*9||',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SELECT @StartTime= GETDATE()
UPDATE #TEMP2 SET COL3 = dbo.fn_CleanUp_JeffOriginal(Col2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES('SQL function: Jeff Original : single bell char',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SELECT @StartTime= GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp_Brigzy(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES('SQL function :Brigzy (C#Screw) Looping',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SELECT @StartTime= GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp_MichaelMeierruth_Original(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES('SQL function: Michael Meierruth Original',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SELECT @StartTime= GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp_MichaelMeierruth_Original(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES('SQL function: Michael Meierruth : with hard coded spaces',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SET @StartTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_Looping_CLR(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES('CLR: Brigzy (C#Screw) Looping',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SET @StartTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_NotLooping_CLR(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES('CLR: Not looping using ||*9*9|| technique in C#',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SET @StartTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_SingleChar_CLR(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES ('CLR: using Jeff''s single char technique in C#',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SET @StartTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_JCB_CLR(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES('CLR :using JCBnew space replacer C#',DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SET @StartTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_PW_CLR(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES('CLR: using Paul White space replacer C#' ,DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SET @StartTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_FLO_CLR(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES ('CLR: using Flo s space replacer C#' ,DATEDIFF(ms,@StartTime,GETDATE()))
go 10
DECLARE @StartTime DATETIME
SET @StartTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_MMSharpScrew_CLR(COL2)
FROM #TEMP2
INSERT INTO #tResults(Tag,Result)VALUES ('CLR: Michael Meierruth - C#Screw conversion: hard coded spaces' ,DATEDIFF(ms,@StartTime,GETDATE()))
go 10
SELECT Tag + CHAR(13) + CAST(AVG(Result) AS VARCHAR(20)) FROM #tResults GROUP BY
Tag ORDER BY SUM(Result) DESC
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp_Brigzy
DROP FUNCTION fn_CleanUp_JeffOriginal
DROP FUNCTION fn_CleanUp_MultiChars
DROP FUNCTION fn_CleanUp_MichaelMeierruth_Original
DROP FUNCTION fn_CleanUp_MichaelMeierruth_MKII
DROP TABLE #TEMP1
DROP TABLE #TEMP2
DROP TABLE #tResults
GO
--SET STATISTICS TIME OFF
C# Gnu
____________________________________________________
November 18, 2009 at 10:17 am
Excellent work so far, C# Screw. In particular, multiple runs with averages should allow much more consistent results, as later runs should end up with more consistent cache behavior.
If you're interested in spending more time on it, I'd suggest test set expansion in two dimensions, and additional tuning/choice information:
Add a test set for strings of size 30-60 (common sizes for names and addresses), and of size 6000-8000 (common "long string" sizes).
And for result sets of more rows - hundreds of thousands and millions (which will illustrate how each technique scales).
For additional tuning/choice information, from SQL Server Profiler, I always like to watch Reads, Writes, CPU, and Duration (usually, but not always, in that order). Reads and Writes in particular would let us get a view for how disk IO intensive each technique is.
As before, excellent work running the benchmarks over and over, adding more of them each time.
P.S. If you'd like to include the nested 2:1 (two spaces to one space) conversions at a depth of the Log base 2 of the length, I'd be very interested to see how that compares against the similar approaches (i.e. for length 30, 5 deep, for 60, 6 deep, for 900, 10 deep, for 8000, 13 deep).
P.P.S. For truly interesting results, comparing the functions against a data set where the meaningful data is dense and the maximum number of contiguous spaces is 3 (i.e. normal sentences with some typos), and against another data set that's almost all spaces with a few non-space words (i.e. an inefficient space padding data source) might also show some differences (many small space groups vs a couple large space groups).
November 18, 2009 at 10:24 am
I will have to pass and offer your suggestions to someone with a realistic networked SQL server ... SQL script is above if u fancy a go - my poor Laptop strains with 50k rows ..:-)
C# Gnu
____________________________________________________
November 18, 2009 at 12:13 pm
Hello Jeff,
Thank you so much for posting this solution! We will apply your method in a SSIS Derived Column Transformation on data coming from raw files. This data needs to be cleansed before it even reaches the SQL Server staging tables, so we can not use CLR functions or SQL functions.
I will let you know how it performs, I'll try to use the horizontal tab (ASCII code 9) as the special character because this character will be replaced by a single space in this same transformation step. I'm curious about the performance in this context, but I must admit that your solution is much more elegant than my initial version with the 65, 33, 17, 9, 5, 3 an 2 spaces nested REPLACE construction. Your method can be applied on very wide columns without a limit to the number of contiguous spaces, and all it needs is one special character.
We can use the same expression for every string column that needs to be cleansed and we only need to replace the column name once. This is an aspect of your method that comes in handy in real life situations where you need to cleans more than a few columns of a few tables.
Keep up the good work!
Dony.
November 18, 2009 at 12:22 pm
Hey everyone, just a quick note before I go to work this morning. I love the way this thread it going - huge thanks especially to C# Screw (lol!) and Flo for tidying my (still slightly amateur) C# as usual! I have a fairly busy morning ahead but will look in this afternoon - no doubt by then someone will have found an improvement which runs in negative time :laugh:
I hope Jeff pops by soon because I'm really interested in the huge performance difference (10x) I found working with the different collations with his cool method.
Paul
November 18, 2009 at 12:39 pm
Good morning Paul!
Paul White (11/18/2009)
... my (still slightly amateur) C# as usual!
Well, now I have to say this to you:
I don't know how much you are working with C#, but you are definitely not an amateur in .NET. Your solution really fine appears to be one of the two fastest solutions 🙂
Remember
(To all the others: This is an insider to joke to Paul)
Greets
Flo
November 18, 2009 at 12:44 pm
jcrawf02 (11/18/2009)
Thought I'd tell you, you guys are just fun to watch. 😉
And provide significant community benefit too!!
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 18, 2009 at 12:45 pm
C# Screw (11/18/2009)
Jeff,as promised : all tests combined SQL & CLR
Beginning execution loop
Batch execution completed 10000 times.
SQL:Looping
2536
SQL: using ||*9*9||
6310
SQL: Jeffs single char
2916
CLR: looping
453
CLR: Not looping using ||*9*9|| technique in C#
296
CLR: Not looping using Jeff's single char technique in C#
416 <-- consistently c# performs slower single space bell character, surprising?
Correction : Noticed one space missing in c#
CLR: Not looping using Jeff's single char technique in C#
246
I am really sorry the c# input has caused some upset, I thought it was really useful/interesting.. so 10/10 article & productive discussion.
Oh no... don't be sorry at all about that. I just wanted to be sure what we were testing against. Thank you very much for taking that time. And thanks for the 10-by marks.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 18, 2009 at 12:47 pm
GSquared (11/18/2009)
Nadrek (11/18/2009)
Jeff Moden (11/17/2009)
You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.
Not likely, perhaps... but very common to have about that many if someone started out with a CHAR(8000), even worse if they did concatenation on a CHAR(8000)+ another VARCHAR or CHAR + LEFT() or RIGHT() to get back to 8000. Some forms of data import also lead to excessive space padding. Are they good SQL? No. Do they happen? I believe they do.
Also, yes, GSquared and I do apparently use some similar methods.
I actually discovered this method almost 20 years ago, when I was building WordPerfect macros to clean up documents that had been typed up in an ASCII text editor. Used it to clean up spaces, underlines, even page-lines created by entering arbitrary numbers of hyphens. Came in very, very handy.
I'll be damned - someone else who did macro-based document management using WordPerfect macros! How kewl is that!? I had 2.5M of macros that did a full-fledged work package management system that included change bar and version tracking, etc. Tied into a dbase IV database system too! 🙂
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 18, 2009 at 12:55 pm
Michael Meierruth (11/18/2009)
Earlier on in this post I mentioned the use of a binary approach. I finally found some time to try this out.
declare @s-2 varchar(8000)
set @s-2 = '*' + replicate(' ',7998) + '*'
select len(@s)
set @s-2 = replace(replace(replace(replace(replace(replace(ltrim(rtrim(@s)),
replicate(' ',32+1),' '),replicate(' ',16+1),' '),replicate(' ',8+1),' '),replicate(' ',4+1),' '),replicate(' ',2+1),' '),replicate(' ',1+1),' ')
select len(@s)
select @s-2
Needless to say I have tried this with Jeff's data as well as with some random data and it works just fine.
As for performance, I generated 100000 records and Jeff's code did in 110 seconds whereas my code did in 16 seconds.
I shaved off a second by replacing the calls to replicate (which I used for clarity) with hardcoded constants or variables.
Oh yes, Jeff's code ran at 100% CPU whereas my code never reached 100%.
There is a recursive effect going on which I'm still trying to understand from a methematical point of view, i.e. for a given N+1 spaces what is the maximum length of blanks that can be reduce to 1 blank using the above approach.
Of course, if someone can find a string with blanks where this doesn't work I will quietly shut up.
Whoa! Very cool... I'm going to have to try that one for sure. Thanks, Michael!
--Jeff Moden
Change is inevitable... Change for the better is not.
November 18, 2009 at 1:12 pm
TheSQLGuru (11/18/2009)
GSquared (11/18/2009)
Nadrek (11/18/2009)
Jeff Moden (11/17/2009)
You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.
Not likely, perhaps... but very common to have about that many if someone started out with a CHAR(8000), even worse if they did concatenation on a CHAR(8000)+ another VARCHAR or CHAR + LEFT() or RIGHT() to get back to 8000. Some forms of data import also lead to excessive space padding. Are they good SQL? No. Do they happen? I believe they do.
Also, yes, GSquared and I do apparently use some similar methods.
I actually discovered this method almost 20 years ago, when I was building WordPerfect macros to clean up documents that had been typed up in an ASCII text editor. Used it to clean up spaces, underlines, even page-lines created by entering arbitrary numbers of hyphens. Came in very, very handy.
I'll be damned - someone else who did macro-based document management using WordPerfect macros! How kewl is that!? I had 2.5M of macros that did a full-fledged work package management system that included change bar and version tracking, etc. Tied into a dbase IV database system too! 🙂
Pipe down, us younguns can't hear the discussion over the creaking of your old bones...:hehe:
---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]
"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."
Viewing 15 posts - 121 through 135 (of 425 total)
You must be logged in to reply to this topic. Login to reply