String Manipulation

  • I have received a requirement to write a trigger that accepts a string and then performs the following actions.

    1.Remove Script tags and contents

    2.Replace all HTML tags with spaces.

    3.Replace with a space

    4.Replace all entity codes (“&xxx;”) with X

    5.Replace all punctuation and math symbols (.,;:’”&()[]+/<>≥≤°÷) with a space (dashes are not replaced) Add percent, back slash, underscore, carat, asterisk, equal sign, curly braces, question and exclamation marks, pipes, dollar and cents signs, pound sign, tabs, crlf

    6.Replace all digits with a space

    7.Replace all one letter words with a space (pattern: space single-character-wildcard space)

    8.Remove redundant spaces.

    Of course, speedy performance is a must.

    I wrote the following UDF which I am now looking to optimize.

    Any thoughts?

    set ANSI_NULLS ON

    set QUOTED_IDENTIFIER ON

    go

    ALTER FUNCTION [dbo].[StripHTML]

    (@HTMLText varchar(MAX))

    RETURNS VARCHAR(MAX)

    AS

    BEGIN

    DECLARE @Start int

    DECLARE @End int

    DECLARE @Length int

    DECLARE @OriginalLength INT

    -- 0. prepend and append a space to the string

    -- this is so that single character words at the beginning of the string

    SELECT @HtmlText = ' ' + @HtmlText + ' '

    -- 1. Replace with an actual space

    SELECT @HtmlText = REPLACE(@HtmlText, ' ', ' ')

    -- 2. Replace all other character codes with an X

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    END

    -- 3. Remove tags and contents

    SET @Start = CHARINDEX(' ', @HTMLText)

    SET @End = CHARINDEX(' ', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 9

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')

    SET @Start = CHARINDEX(' ', @HTMLText)

    SET @End = CHARINDEX(' ', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 9

    END

    -- 4. Remove all HTML tags

    SET @Start = CHARINDEX('<', @HTMLText)

    SET @End = CHARINDEX('>', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')

    SET @Start = CHARINDEX('<', @HTMLText)

    SET @End = CHARINDEX('>', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    END

    -- 5. Remove all special characters

    DECLARE @CleanString VARCHAR(MAX)

    ;WITH CTE_TALLY AS

    (

    -- Create a Tally CTE from 1 to whatever the length of the parameter is

    SELECT TOP (LEN(@HTMLText)) ROW_NUMBER() OVER (ORDER BY t1.Object_ID) AS N

    FROM Master.sys.All_Columns t1

    CROSS JOIN Master.sys.All_Columns t2

    )

    SELECT @CleanString = ISNULL(@CleanString,'') + SUBSTRING(@HTMLText,N,1)

    FROM CTE_TALLY WITH (NOLOCK)

    WHERE N <= LEN(@HTMLText)

    AND SUBSTRING(@HTMLText,N,1) NOT LIKE ('[~`!@#$%^&*()¢<>+_={}|\/?".,''≥≤°÷

    0123456789]')

    AND SUBSTRING(@HTMLText,N,1) NOT LIKE ']'

    AND SUBSTRING(@HTMLText,N,1) NOT LIKE '['

    -- 6. Remove single character words

    WHILE PATINDEX('% _ %' , @CleanString) > 0

    SET @CleanString = STUFF(@CleanString, PATINDEX('% _ %',@CleanString), 3, ' ')

    -- 7. Remove all extra spaces

    WHILE CharIndex(' ',@CleanString) > 0

    SELECT @CleanString = Replace(@CleanString, ' ', ' ')

    RETURN LTRIM(RTRIM(@CleanString))

    END

  • I have to say that this sounds like the ideal place to create a CLR function so you can use Regular Expressions to clean up the data. I wish I could offer more advice than that, but I haven't done it so I can't provide any code do it.

  • Jack Corbett (3/13/2009)


    I have to say that this sounds like the ideal place to create a CLR function so you can use Regular Expressions to clean up the data. I wish I could offer more advice than that, but I haven't done it so I can't provide any code do it.

    Unfortunately, we can not use CLR 🙁

    I'm wondering if there is a smarter way to do any of the steps, or if multiple steps can be combined into one.

  • Thoughts, since you have some logic laid out already.

    First, I see you know how to use tally tables, so use them ... over and over again... get rid of all your while loops. Let me know if you need examples.

    Second, if you are going to be running this against entire sets of rows, turn it into an inline table valued function. Use CROSS APPLY to return the cleaned up values. It will run much faster.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Goldie Graber (3/13/2009)


    I have received a requirement to write a trigger that accepts a string and then performs the following actions.

    1.Remove Script tags and contents

    2.Replace all HTML tags with spaces.

    3.Replace with a space

    4.Replace all entity codes (“&xxx;”) with X

    5.Replace all punctuation and math symbols (.,;:’”&()[]+/<>==°÷) with a space (dashes are not replaced) Add percent, back slash, underscore, carat, asterisk, equal sign, curly braces, question and exclamation marks, pipes, dollar and cents signs, pound sign, tabs, crlf

    6.Replace all digits with a space

    7.Replace all one letter words with a space (pattern: space single-character-wildcard space)

    8.Remove redundant spaces.

    Of course, speedy performance is a must.

    Hi Goldie...

    Not sure I can help but it would make it easier to try if you attached a file that has an example HTML that you're trying to clean up.

    Also, just curious, why does this need to be done? I see your good "rules" above, but what is the purpose behind all of those rules?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bob Hovious (3/13/2009)


    Thoughts, since you have some logic laid out already.

    First, I see you know how to use tally tables, so use them ... over and over again... get rid of all your while loops. Let me know if you need examples.

    Second, if you are going to be running this against entire sets of rows, turn it into an inline table valued function. Use CROSS APPLY to return the cleaned up values. It will run much faster.

    I'm not sure how to use a tally table to get rid of the HTML tags.

    I'm also thinking there should be a way to combine some of my steps, but I can't seem to figure out how.

  • Jeff Moden (3/14/2009)

    Hi Goldie...

    Not sure I can help but it would make it easier to try if you attached a file that has an example HTML that you're trying to clean up.

    Also, just curious, why does this need to be done? I see your good "rules" above, but what is the purpose behind all of those rules?

    The purpose of this function is actually to get a word count. We have to first strip out anything that is not a a word and then count. The last requirement on the list is:

    9. Count words. If (length of new string) > 0 (number of words = (length of new string) – (length of new string with spaces removed) + 1) else 0

    I have attached a sample HTML snippet.

  • Goldie Graber (3/14/2009)


    Jeff Moden (3/14/2009)

    Hi Goldie...

    Not sure I can help but it would make it easier to try if you attached a file that has an example HTML that you're trying to clean up.

    Also, just curious, why does this need to be done? I see your good "rules" above, but what is the purpose behind all of those rules?

    The purpose of this function is actually to get a word count. We have to first strip out anything that is not a a word and then count. The last requirement on the list is:

    9. Count words. If (length of new string) > 0 (number of words = (length of new string) – (length of new string with spaces removed) + 1) else 0

    I have attached a sample HTML snippet.

    Something happened in the transalation... it looks like plain text rather than a sample of HTML. Could you attach a sample of the HTML you're trying to strip out?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Download the file and save as text.

    If you try to view it in your browser it will translate the HTML tags.

  • Goldie Graber (3/14/2009)


    Download the file and save as text.

    If you try to view it in your browser it will translate the HTML tags.

    Ack... of course... stupid me. Thanks, Goldie.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I know that this'll drive Jeff crazy, but I actually think that this would work better as a CLR Trigger.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • As much as I praise their use, there are some things that the Tally table won't beat a WHILE loop at... cleaning strings using STUFF is one of those things provided that you do it just right. Here's what I came up with...

    --===== Replace all HTML tags with nothing

    WHILE CHARINDEX(' CHARINDEX('<',@HTMLText)

    SELECT @HTMLText = STUFF(@HTMLText,

    CHARINDEX('<',@HTMLText),

    CHARINDEX('>',@HTMLText)-CHARINDEX('<',@HTMLText)+1,

    '')

    --===== Replace all EntityCodes with the actual ascii character (Just to be safe instead of dumping them)

    WHILE CHARINDEX('&',@HTMLText) > 0 AND CHARINDEX(';',@HTMLText) > CHARINDEX('&',@HTMLText)

    AND CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)=4

    SELECT @HTMLText = STUFF(@HTMLText,

    CHARINDEX('&',@HTMLText),

    CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)+1,

    CHAR(SUBSTRING(@HTMLText,CHARINDEX('&',@HTMLText)+1,3)))

    --===== Replace all special characters (except dashes and spaces) and digits with a space

    WHILE PATINDEX('%[^- A-Z]%',@HTMLText) > 0

    SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('%[^- A-Z]%',@HTMLText),1,' ')

    --===== Replace single letter words with a space

    WHILE PATINDEX('% _ %',@HTMLText) > 0

    SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('% _ %',@HTMLText),3,' ')

    --===== Replace multiple spaces with a single space

    WHILE CHARINDEX(' ',@HTMLText) > 0

    SELECT @HTMLText = REPLACE(@HTMLText,' ',' ')

    --===== Drop any leading or trailing spaces

    SELECT @HTMLText = LTRIM(RTRIM(@HTMLText))

    And, here's what the result looks like... (I let it wrap on this screen)...

    DOSAGE Adults-- to teaspoonfuls Children six to twelve-- to teaspoonful Children two to six years-- teaspoonful These doses may be repeated in four hours if necessary but not more than four times in twenty-four hours WARNING Persons with high fever or persistent cough should not use this preparation unless directed by physician Do not exceed recommended dosage

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • RBarryYoung (3/14/2009)


    I know that this'll drive Jeff crazy, but I actually think that this would work better as a CLR Trigger.

    It might... but Nah... "we don need no stinkin' CLR" 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry Goldie... I forgot the last step... I'll be right back...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/14/2009)


    RBarryYoung (3/14/2009)


    I know that this'll drive Jeff crazy, but I actually think that this would work better as a CLR Trigger.

    It might... but Nah... "we don need no stinkin' CLR" 😛

    Uh, oh. Now that sounds like a challenge! ... 😀

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 1 through 15 (of 60 total)

You must be logged in to reply to this topic. Login to reply