REPLACE Multiple Spaces with One

  • Thought I'd tell you, you guys are just fun to watch. 😉

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • Jeff Moden (11/17/2009)

    You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.

    Not likely, perhaps... but very common to have about that many if someone started out with a CHAR(8000), even worse if they did concatenation on a CHAR(8000)+ another VARCHAR or CHAR + LEFT() or RIGHT() to get back to 8000. Some forms of data import also lead to excessive space padding. Are they good SQL? No. Do they happen? I believe they do.

    Also, yes, GSquared and I do apparently use some similar methods.

  • Nadrek (11/18/2009)


    Jeff Moden (11/17/2009)

    You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.

    Not likely, perhaps... but very common to have about that many if someone started out with a CHAR(8000), even worse if they did concatenation on a CHAR(8000)+ another VARCHAR or CHAR + LEFT() or RIGHT() to get back to 8000. Some forms of data import also lead to excessive space padding. Are they good SQL? No. Do they happen? I believe they do.

    Also, yes, GSquared and I do apparently use some similar methods.

    I actually discovered this method almost 20 years ago, when I was building WordPerfect macros to clean up documents that had been typed up in an ASCII text editor. Used it to clean up spaces, underlines, even page-lines created by entering arbitrary numbers of hyphens. Came in very, very handy.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Ok folks,

    1/everyone's SQL is now wrapped in functions

    2/modifed SQL batch to capture results to table

    3/added Michael Meierruth version with hard coded spaces

    4/converted Michael's version to C# CLR

    Here are the results:

    (you will like this Paul - some variation on execution)

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL function: Replace Technique : but using replacement chars ||*9*9||

    2776

    SQL function :Brigzy (C#Screw) Looping

    2596

    SQL function: Jeff Original : single bell char

    1996

    CLR: Brigzy (C#Screw) Looping

    453

    SQL function: Michael Meierruth Original

    333

    SQL function: Michael Meierruth : with hard coded spaces

    316

    CLR: Not looping using ||*9*9|| technique in C#

    290

    CLR: using Jeff's single char technique in C#

    250

    CLR :using JCBnew space replacer C#

    166

    CLR: Michael Meierruth - C#Screw conversion: hard coded spaces

    160

    CLR: using Flo s space replacer C#

    116

    CLR: using Paul White space replacer C#

    113

    Note that Jeffs SQL has moved up now it is an SQL function

  • For interest I improved test:

    1/ ramped up workload from 10k rows to 50k rows,

    2/ each function tested 10 times in succession, average the result:

    Average exec times (50krows):

    SQL function: Replace Technique : but using replacement chars ||*9*9||

    13490

    SQL function :Brigzy (C#Screw) Looping

    12337

    SQL function: Jeff Original : single bell char

    9340

    CLR: Brigzy (C#Screw) Looping

    1641

    SQL function: Michael Meierruth : with hard coded spaces

    1562

    SQL function: Michael Meierruth Original

    1559

    CLR: Not looping using ||*9*9|| technique in C#

    1449

    CLR: using Jeff's single char technique in C#

    1187

    CLR :using JCBnew space replacer C#

    818

    CLR: Michael Meierruth - C#Screw conversion: hard coded spaces

    753

    CLR: using Flo s space replacer C#

    556

    CLR: using Paul White space replacer C#

    532

    SQL F.Y.I.

    --PREPARE

    SET NOCOUNT ON

    go

    CREATE FUNCTION dbo.fn_CleanUp_Brigzy(@S VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    WHILE CHARINDEX(' ',@S) > 0

    SELECT @s-2 = REPLACE(@S,' ',' ')

    RETURN @s-2

    END

    go

    CREATE FUNCTION dbo.fn_CleanUp_JeffOriginal(@S VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    RETURN REPLACE(

    REPLACE(

    REPLACE(

    LTRIM(RTRIM(@S)) ,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model

    ,CHAR(7)+' ','') --Changes the XO model to nothing

    ,CHAR(7),'') --Changes the remaining X's to nothing

    END

    go

    CREATE FUNCTION dbo.fn_CleanUp_MultiChars(@S VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    RETURN LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(@S,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    END

    GO

    CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKII(@S VARCHAR(8000))

    RETURNS VARCHAR(8000)

    BEGIN

    RETURN replace(replace(replace(replace(replace(replace(ltrim(rtrim(@s)),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    END

    go

    CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_Original(@S VARCHAR(8000))

    RETURNS VARCHAR(8000)

    BEGIN

    --declare @s-2 varchar(8000)

    --set @s-2 = '*' + replicate(' ',7998) + '*'

    --select len(@s)

    set @s-2 = replace(replace(replace(replace(replace(replace(ltrim(rtrim(@s)),

    replicate(' ',32+1),' '),replicate(' ',16+1),' '),replicate(' ',8+1),' '),replicate(' ',4+1),' '),replicate(' ',2+1),' '),replicate(' ',1+1),' ')

    --select len(@s)

    RETURN @s-2

    END

    go

    CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

    CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

    go

    --INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

    DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

    SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

    INSERT INTO #TEMP1 (COL1)

    OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

    SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

    GO 50000

    -- Test result table

    CREATE TABLE #tResults(Tag VARCHAR(100),Result int)

    go

    -- Test 1 : Jeff technique : multi chars for replace char

    DECLARE @StartTime DATETIME

    SELECT @StartTime= GETDATE()

    UPDATE #TEMP2 SET COL3 = dbo.fn_CleanUp_MultiChars(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES ('SQL function: Replace Technique : but using replacement chars ||*9*9||',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SELECT @StartTime= GETDATE()

    UPDATE #TEMP2 SET COL3 = dbo.fn_CleanUp_JeffOriginal(Col2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES('SQL function: Jeff Original : single bell char',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SELECT @StartTime= GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp_Brigzy(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES('SQL function :Brigzy (C#Screw) Looping',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SELECT @StartTime= GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp_MichaelMeierruth_Original(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES('SQL function: Michael Meierruth Original',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SELECT @StartTime= GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp_MichaelMeierruth_Original(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES('SQL function: Michael Meierruth : with hard coded spaces',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SET @StartTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_Looping_CLR(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES('CLR: Brigzy (C#Screw) Looping',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SET @StartTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_NotLooping_CLR(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES('CLR: Not looping using ||*9*9|| technique in C#',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SET @StartTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_SingleChar_CLR(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES ('CLR: using Jeff''s single char technique in C#',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SET @StartTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_JCB_CLR(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES('CLR :using JCBnew space replacer C#',DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SET @StartTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_PW_CLR(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES('CLR: using Paul White space replacer C#' ,DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SET @StartTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_FLO_CLR(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES ('CLR: using Flo s space replacer C#' ,DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    DECLARE @StartTime DATETIME

    SET @StartTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_MMSharpScrew_CLR(COL2)

    FROM #TEMP2

    INSERT INTO #tResults(Tag,Result)VALUES ('CLR: Michael Meierruth - C#Screw conversion: hard coded spaces' ,DATEDIFF(ms,@StartTime,GETDATE()))

    go 10

    SELECT Tag + CHAR(13) + CAST(AVG(Result) AS VARCHAR(20)) FROM #tResults GROUP BY

    Tag ORDER BY SUM(Result) DESC

    --CLEANUP

    DROP FUNCTION dbo.fn_CleanUp_Brigzy

    DROP FUNCTION fn_CleanUp_JeffOriginal

    DROP FUNCTION fn_CleanUp_MultiChars

    DROP FUNCTION fn_CleanUp_MichaelMeierruth_Original

    DROP FUNCTION fn_CleanUp_MichaelMeierruth_MKII

    DROP TABLE #TEMP1

    DROP TABLE #TEMP2

    DROP TABLE #tResults

    GO

    --SET STATISTICS TIME OFF

  • Excellent work so far, C# Screw. In particular, multiple runs with averages should allow much more consistent results, as later runs should end up with more consistent cache behavior.

    If you're interested in spending more time on it, I'd suggest test set expansion in two dimensions, and additional tuning/choice information:

    Add a test set for strings of size 30-60 (common sizes for names and addresses), and of size 6000-8000 (common "long string" sizes).

    And for result sets of more rows - hundreds of thousands and millions (which will illustrate how each technique scales).

    For additional tuning/choice information, from SQL Server Profiler, I always like to watch Reads, Writes, CPU, and Duration (usually, but not always, in that order). Reads and Writes in particular would let us get a view for how disk IO intensive each technique is.

    As before, excellent work running the benchmarks over and over, adding more of them each time.

    P.S. If you'd like to include the nested 2:1 (two spaces to one space) conversions at a depth of the Log base 2 of the length, I'd be very interested to see how that compares against the similar approaches (i.e. for length 30, 5 deep, for 60, 6 deep, for 900, 10 deep, for 8000, 13 deep).

    P.P.S. For truly interesting results, comparing the functions against a data set where the meaningful data is dense and the maximum number of contiguous spaces is 3 (i.e. normal sentences with some typos), and against another data set that's almost all spaces with a few non-space words (i.e. an inefficient space padding data source) might also show some differences (many small space groups vs a couple large space groups).

  • I will have to pass and offer your suggestions to someone with a realistic networked SQL server ... SQL script is above if u fancy a go - my poor Laptop strains with 50k rows ..:-)

  • Hello Jeff,

    Thank you so much for posting this solution! We will apply your method in a SSIS Derived Column Transformation on data coming from raw files. This data needs to be cleansed before it even reaches the SQL Server staging tables, so we can not use CLR functions or SQL functions.

    I will let you know how it performs, I'll try to use the horizontal tab (ASCII code 9) as the special character because this character will be replaced by a single space in this same transformation step. I'm curious about the performance in this context, but I must admit that your solution is much more elegant than my initial version with the 65, 33, 17, 9, 5, 3 an 2 spaces nested REPLACE construction. Your method can be applied on very wide columns without a limit to the number of contiguous spaces, and all it needs is one special character.

    We can use the same expression for every string column that needs to be cleansed and we only need to replace the column name once. This is an aspect of your method that comes in handy in real life situations where you need to cleans more than a few columns of a few tables.

    Keep up the good work!

    Dony.

  • Hey everyone, just a quick note before I go to work this morning. I love the way this thread it going - huge thanks especially to C# Screw (lol!) and Flo for tidying my (still slightly amateur) C# as usual! I have a fairly busy morning ahead but will look in this afternoon - no doubt by then someone will have found an improvement which runs in negative time :laugh:

    I hope Jeff pops by soon because I'm really interested in the huge performance difference (10x) I found working with the different collations with his cool method.

    Paul

  • Good morning Paul!

    Paul White (11/18/2009)


    ... my (still slightly amateur) C# as usual!

    Well, now I have to say this to you:

    I don't know how much you are working with C#, but you are definitely not an amateur in .NET. Your solution really fine appears to be one of the two fastest solutions 🙂

    Remember

    (To all the others: This is an insider to joke to Paul)

    Greets

    Flo

  • jcrawf02 (11/18/2009)


    Thought I'd tell you, you guys are just fun to watch. 😉

    And provide significant community benefit too!!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • C# Screw (11/18/2009)


    Jeff,

    as promised : all tests combined SQL & CLR

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2536

    SQL: using ||*9*9||

    6310

    SQL: Jeffs single char

    2916

    CLR: looping

    453

    CLR: Not looping using ||*9*9|| technique in C#

    296

    CLR: Not looping using Jeff's single char technique in C#

    416 <-- consistently c# performs slower single space bell character, surprising?

    Correction : Noticed one space missing in c#

    CLR: Not looping using Jeff's single char technique in C#

    246

    I am really sorry the c# input has caused some upset, I thought it was really useful/interesting.. so 10/10 article & productive discussion.

    Oh no... don't be sorry at all about that. I just wanted to be sure what we were testing against. Thank you very much for taking that time. And thanks for the 10-by marks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • GSquared (11/18/2009)


    Nadrek (11/18/2009)


    Jeff Moden (11/17/2009)

    You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.

    Not likely, perhaps... but very common to have about that many if someone started out with a CHAR(8000), even worse if they did concatenation on a CHAR(8000)+ another VARCHAR or CHAR + LEFT() or RIGHT() to get back to 8000. Some forms of data import also lead to excessive space padding. Are they good SQL? No. Do they happen? I believe they do.

    Also, yes, GSquared and I do apparently use some similar methods.

    I actually discovered this method almost 20 years ago, when I was building WordPerfect macros to clean up documents that had been typed up in an ASCII text editor. Used it to clean up spaces, underlines, even page-lines created by entering arbitrary numbers of hyphens. Came in very, very handy.

    I'll be damned - someone else who did macro-based document management using WordPerfect macros! How kewl is that!? I had 2.5M of macros that did a full-fledged work package management system that included change bar and version tracking, etc. Tied into a dbase IV database system too! 🙂

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Michael Meierruth (11/18/2009)


    Earlier on in this post I mentioned the use of a binary approach. I finally found some time to try this out.

    declare @s-2 varchar(8000)

    set @s-2 = '*' + replicate(' ',7998) + '*'

    select len(@s)

    set @s-2 = replace(replace(replace(replace(replace(replace(ltrim(rtrim(@s)),

    replicate(' ',32+1),' '),replicate(' ',16+1),' '),replicate(' ',8+1),' '),replicate(' ',4+1),' '),replicate(' ',2+1),' '),replicate(' ',1+1),' ')

    select len(@s)

    select @s-2

    Needless to say I have tried this with Jeff's data as well as with some random data and it works just fine.

    As for performance, I generated 100000 records and Jeff's code did in 110 seconds whereas my code did in 16 seconds.

    I shaved off a second by replacing the calls to replicate (which I used for clarity) with hardcoded constants or variables.

    Oh yes, Jeff's code ran at 100% CPU whereas my code never reached 100%.

    There is a recursive effect going on which I'm still trying to understand from a methematical point of view, i.e. for a given N+1 spaces what is the maximum length of blanks that can be reduce to 1 blank using the above approach.

    Of course, if someone can find a string with blanks where this doesn't work I will quietly shut up.

    Whoa! Very cool... I'm going to have to try that one for sure. Thanks, Michael!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TheSQLGuru (11/18/2009)


    GSquared (11/18/2009)


    Nadrek (11/18/2009)


    Jeff Moden (11/17/2009)

    You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.

    Not likely, perhaps... but very common to have about that many if someone started out with a CHAR(8000), even worse if they did concatenation on a CHAR(8000)+ another VARCHAR or CHAR + LEFT() or RIGHT() to get back to 8000. Some forms of data import also lead to excessive space padding. Are they good SQL? No. Do they happen? I believe they do.

    Also, yes, GSquared and I do apparently use some similar methods.

    I actually discovered this method almost 20 years ago, when I was building WordPerfect macros to clean up documents that had been typed up in an ASCII text editor. Used it to clean up spaces, underlines, even page-lines created by entering arbitrary numbers of hyphens. Came in very, very handy.

    I'll be damned - someone else who did macro-based document management using WordPerfect macros! How kewl is that!? I had 2.5M of macros that did a full-fledged work package management system that included change bar and version tracking, etc. Tied into a dbase IV database system too! 🙂

    Pipe down, us younguns can't hear the discussion over the creaking of your old bones...:hehe:

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

Viewing 15 posts - 121 through 135 (of 425 total)

You must be logged in to reply to this topic. Login to reply