String Manipulation

  • [font="Verdana"]You might be better off writing something in C# or VB, and linking it in as an assembly. That way you can use the .Net regular expressions.[/font]

  • Hey Bruce,

    If they're going to go that far, why not just help write some front-end editing and leave the db out of it? Or is this a batch load of some sort?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Thanks, got it.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • [font="Verdana"]Bob,

    That would be ideal. No idea if they have that option though.

    Another option would be to cast it to some form of XML. No idea if that would work, just "thinking outside the square".

    T-SQL is a bit limited for in-depth string manipulation, particularly around regular expressions. So many time I have wished for a function that's the equivalent of a REPLACE(), but based around a regular expression.

    [/font]

  • Goldie Graber (3/15/2009)


    Jeff, that looks awesome.

    I'm going to test it out at work tomorrow.

    One part I don't quite understand is this:

    --===== Replace all EntityCodes with the actual ascii character (Just to be safe instead of dumping them)

    WHILE CHARINDEX('&',@HTMLText) > 0 AND CHARINDEX(';',@HTMLText) > CHARINDEX('&',@HTMLText)

    AND CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)=4

    One of your requirements was to ....

    4. Replace all entity codes (“&xxx;”) with X

    If you "smear" that requirement across all the other requirements including that of replacing certain symbols and single letter words with spaces, all I did was to first detect the entity code and then change the 3 digits to an ASCII code. For example, & 0 4 1 ; (without the spaces) would be replaced by a right parenthesis. It also allows for an & to precede a ; in the actual text without being converted because it checks for the proper length of the entity code (which is actually 5 but that's the way this type of math works out).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/16/2009)


    One of your requirements was to ....

    4. Replace all entity codes (“&xxx;”) with X

    If you "smear" that requirement across all the other requirements including that of replacing certain symbols and single letter words with spaces, all I did was to first detect the entity code and then change the 3 digits to an ASCII code. For example, & 0 4 1 ; (without the spaces) would be replaced by a right parenthesis. It also allows for an & to precede a ; in the actual text without being converted because it checks for the proper length of the entity code (which is actually 5 but that's the way this type of math works out).

    That's what I thought. Unfortunately, we can't assume that entity codes are exactly 5 characters.

    The ones which represent characters in other languages are longer.

    Here's the code I'm using now.

    -- Replace all other entity codes with an X

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    IF CHARINDEX(' ',SUBSTRING(@HTMLText, @Start, @Length)) = 0

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')

    SET @Start = CHARINDEX('&', @HTMLText, @Start + 1)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    END

  • By the way, Goldie... no special test is required for the non-breaking space because of the way I detect characters that need to be replaced with a space.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, do you have a test set for this?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I can't seem to get better performance than my original function.

    I'm pasting it here for future reference.

    [font="Courier New"]SET ANSI_NULLS ON

    SET QUOTED_IDENTIFIER ON

    GO

    ALTER FUNCTION [dbo].[GetWordCount]

    (@HTMLText VARCHAR(MAX))

    RETURNS INT

    AS

    BEGIN

    DECLARE @Start  INT

    DECLARE @End    INT

    DECLARE @Length INT

    DECLARE @NewLength INT

    DECLARE @LengthWithoutSpaces INT

    DECLARE @WordCount INT

    -- Remove <head>

    SET @Start = CHARINDEX('<head>', @HTMLText)

    SET @End = CHARINDEX('</head>', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 7

    WHILE (@Start > 0 AND @End > 0)

    BEGIN

       SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')

    SET @Start = CHARINDEX('<head>', @HTMLText)

    SET @End = CHARINDEX('</head>', @HTMLText, @Start)

       SET @Length = (@End - @Start) + 7

    END

    -- Remove <script> tags and contents

    SET @Start = CHARINDEX('<script', @HTMLText)

    SET @End = CHARINDEX('</script>', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 8

    WHILE (@Start > 0 AND @End > 0)

    BEGIN

       SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')

       SET @Start = CHARINDEX('<script', @HTMLText)

       SET @End = CHARINDEX('</script>', @HTMLText, @Start)

       SET @Length = (@End - @Start) + 8

    END

    -- Remove <style> tags and contents

    SET @Start = CHARINDEX('<style', @HTMLText)

    SET @End = CHARINDEX('</style>', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 7

    WHILE (@Start > 0 AND @End > 0)

    BEGIN

       SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')

       SET @Start = CHARINDEX('<style', @HTMLText)

       SET @End = CHARINDEX('</style>', @HTMLText, @Start)

       SET @Length = (@End - @Start) + 7

    END

    -- Replace all HTML tags with a space

    SET @Start = CHARINDEX('<', @HTMLText)

    SET @End = CHARINDEX('>', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0)

    BEGIN

       SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')

       SET @Start = CHARINDEX('<', @HTMLText)

       SET @End = CHARINDEX('>', @HTMLText, @Start)

       SET @Length = (@End - @Start) + 1

    END

    -- Replace   with an actual space

    SELECT @HtmlText = REPLACE(@HtmlText, ' ', ' ')

    -- Replace all other entity codes with an X

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 )

    BEGIN

       IF CHARINDEX(' ',SUBSTRING(@HTMLText, @Start, @Length)) = 0

           SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')

       SET @Start = CHARINDEX('&', @HTMLText, @Start + 1)

       SET @End = CHARINDEX(';', @HTMLText, @Start)

       SET @Length = (@End - @Start) + 1

    END

    -- Replace all special characters with a space

    DECLARE @CleanString VARCHAR(MAX)

    ;WITH CTE_TALLY AS

    (

       -- Create a Tally CTE from 1 to whatever the length of the parameter is

       SELECT TOP (LEN(@HTMLText)) ROW_NUMBER() OVER (ORDER BY t1.OBJECT_ID) AS N  

       FROM MASTER.sys.All_Columns t1  

       CROSS JOIN MASTER.sys.All_Columns t2

    )

    SELECT @CleanString = ISNULL(@CleanString,'') + CASE WHEN SUBSTRING(@HTMLText,N,1) LIKE '[~`!@#$%^&*:;()¢<>+_={}|\/?".,''==°÷

    0123456789]' OR SUBSTRING(@HTMLText,N,1) LIKE ']' OR SUBSTRING(@HTMLText,N,1) LIKE '[' THEN ' ' ELSE SUBSTRING(@HTMLText,N,1) END

    FROM CTE_TALLY WITH (NOLOCK)

    WHERE N <= LEN(@HTMLText)

    -- Remove single character words

    SELECT @CleanString = ' ' + @CleanString + ' '

    WHILE PATINDEX('% _ %' , @CleanString) > 0

       SET @CleanString = STUFF(@CleanString, PATINDEX('% _ %',@CleanString), 3, ' ')

    -- Remove all extra spaces

    WHILE CHARINDEX('  ',@CleanString) > 0

       SELECT @CleanString = REPLACE(@CleanString, '  ', ' ')

    SELECT @CleanString = LTRIM(RTRIM(@CleanString))

    -- Calculate word count

    SELECT @NewLength = LEN(@CleanString)

    SELECT @LengthWithoutSpaces = LEN(REPLACE(@CleanString, ' ', ''))

    IF @LengthWithoutSpaces > 0

       SET @WordCount = (@NewLength - @LengthWithoutSpaces + 1)

    ELSE

       SET @WordCount = 0

    RETURN @WordCount

    END

    [/font]

  • RBarryYoung (3/16/2009)


    Jeff, do you have a test set for this?

    Only the "single line" of HTML Goldie previously attached a couple of posts back...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Goldie Graber (3/16/2009)


    I can't seem to get better performance than my original function.

    I'm pasting it here for future reference.

    Do you have the code for the new function... I know it will likely be nearly identical to the code I posted, but it would be good for me to check.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/16/2009)


    Goldie Graber (3/16/2009)


    I can't seem to get better performance than my original function.

    I'm pasting it here for future reference.

    Do you have the code for the new function... I know it will likely be nearly identical to the code I posted, but it would be good for me to check.

    I posted it in my last post.

    Very similar to yours, but not identical.

  • Goldie Graber (3/16/2009)


    Jeff Moden (3/16/2009)


    Goldie Graber (3/16/2009)


    I can't seem to get better performance than my original function.

    I'm pasting it here for future reference.

    Do you have the code for the new function... I know it will likely be nearly identical to the code I posted, but it would be good for me to check.

    I posted it in my last post.

    Very similar to yours, but not identical.

    Heh... well try mine. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Goldie Graber (3/16/2009)


    Jeff Moden (3/16/2009)


    One of your requirements was to ....

    4. Replace all entity codes (“&xxx;”) with X

    If you "smear" that requirement across all the other requirements including that of replacing certain symbols and single letter words with spaces, all I did was to first detect the entity code and then change the 3 digits to an ASCII code. For example, & 0 4 1 ; (without the spaces) would be replaced by a right parenthesis. It also allows for an & to precede a ; in the actual text without being converted because it checks for the proper length of the entity code (which is actually 5 but that's the way this type of math works out).

    That's what I thought. Unfortunately, we can't assume that entity codes are exactly 5 characters.

    The ones which represent characters in other languages are longer.

    Here's the code I'm using now.

    -- Replace all other entity codes with an X

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    IF CHARINDEX(' ',SUBSTRING(@HTMLText, @Start, @Length)) = 0

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')

    SET @Start = CHARINDEX('&', @HTMLText, @Start + 1)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    END

    What's the maximum length they can be?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/16/2009)


    Heh... well try mine. 😛

    I did. Mine was faster 😉

Viewing 15 posts - 31 through 45 (of 60 total)

You must be logged in to reply to this topic. Login to reply