String Manipulation

  • Here's the code with the last step of counting the actual words...

    --===== Replace all HTML tags with nothing

    WHILE CHARINDEX(' CHARINDEX('<',@HTMLText)

    SELECT @HTMLText = STUFF(@HTMLText,

    CHARINDEX('<',@HTMLText),

    CHARINDEX('>',@HTMLText)-CHARINDEX('<',@HTMLText)+1,

    '')

    --===== Replace all EntityCodes with the actual ascii character (Just to be safe instead of dumping them)

    WHILE CHARINDEX('&',@HTMLText) > 0 AND CHARINDEX(';',@HTMLText) > CHARINDEX('&',@HTMLText)

    AND CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)=4

    SELECT @HTMLText = STUFF(@HTMLText,

    CHARINDEX('&',@HTMLText),

    CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)+1,

    CHAR(SUBSTRING(@HTMLText,CHARINDEX('&',@HTMLText)+1,3)))

    --===== Replace all special characters (except dashes and spaces) and digits with a space

    WHILE PATINDEX('%[^- A-Z]%',@HTMLText) > 0

    SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('%[^- A-Z]%',@HTMLText),1,' ')

    --===== Replace single letter words with a space

    WHILE PATINDEX('% _ %',@HTMLText) > 0

    SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('% _ %',@HTMLText),3,' ')

    --===== Replace multiple spaces with a single space

    WHILE CHARINDEX(' ',@HTMLText) > 0

    SELECT @HTMLText = REPLACE(@HTMLText,' ',' ')

    --===== Drop any leading or trailing spaces

    SELECT @HTMLText = LTRIM(RTRIM(@HTMLText))

    --===== Now, count the words

    SELECT ISNULL(NULLIF(LEN(@HTMLText),0)-LEN(REPLACE(@HTMLText,' ','')) + 1, 0)

    Since these are memory only single-step WHILE loops, as much as I hate to admit it, they'll beat the Tally table quite handily. They'll probably be fast enough to smoke everything except an honest to goodness RegEx CLR but they'll probably be fast enough to make that not worth it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • RBarryYoung (3/14/2009)


    Jeff Moden (3/14/2009)


    RBarryYoung (3/14/2009)


    I know that this'll drive Jeff crazy, but I actually think that this would work better as a CLR Trigger.

    It might... but Nah... "we don need no stinkin' CLR" 😛

    Uh, oh. Now that sounds like a challenge! ... 😀

    The gauntlet has, indeed, been thrown... go for it. But, heh... I'll have to take your word for it because I never intend to enable CLR's on my machines. 😀 For that matter, being the data troll that I am, I gave up "GUI" code a long time ago and can't even spell C#. :hehe:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/14/2009)


    The gauntlet has, indeed, been thrown... go for it. But, heh... I'll have to take your word for it because I never intend to enable CLR's on my machines. 😀 For that matter, being the data troll that I am, I gave up "GUI" code a long time ago and can't even spell C#. :hehe:

    Oh, don't worry, I prefer vb.Net anyway. Uh, for personal use, of course. Professionally, I don't care either way. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Goldie: in Rule 2 (Step 1) what character are you replacing with a space? All of my attempts to extract/examine it from your post show it as a space too.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Barry,

    Just making sure you noticed, but Goldie said that the CLR is not an option in this case. I had suggested it originally as well.

  • Jack Corbett (3/15/2009)


    Barry,

    Just making sure you noticed, but Goldie said that the CLR is not an option in this case. I had suggested it originally as well.

    Heh... yeah... but it's still a great opportunity to further compare CLR solutions with T-SQL. I, for one, would like to see someone write the CLR... two are possible, I believe... one that uses RegEx (which will likely be the fastest) and one that does not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • RBarryYoung (3/15/2009)


    Goldie: in Rule 2 (Step 1) what character are you replacing with a space? All of my attempts to extract/examine it from your post show it as a space too.

    It is the character code for a non break space. ( nbsp )

    I didn't realize the forum decoded it.

  • Jeff, that looks awesome.

    I'm going to test it out at work tomorrow.

    One part I don't quite understand is this:

    --===== Replace all EntityCodes with the actual ascii character (Just to be safe instead of dumping them)

    WHILE CHARINDEX('&',@HTMLText) > 0 AND CHARINDEX(';',@HTMLText) > CHARINDEX('&',@HTMLText)

    AND CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)=4

  • Erp! :blush: Well that takes all the fun out of it...

    Oh well, maybe I can come up with an interesting alternate approach... 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff Moden (3/15/2009)


    Jack Corbett (3/15/2009)


    Barry,

    Just making sure you noticed, but Goldie said that the CLR is not an option in this case. I had suggested it originally as well.

    Heh... yeah... but it's still a great opportunity to further compare CLR solutions with T-SQL. I, for one, would like to see someone write the CLR... two are possible, I believe... one that uses RegEx (which will likely be the fastest) and one that does not.

    Actually, the fastest way to do these is almost always a DFSA Transducer (deterministic finite-state automaton), but they are definitely not "ad-hoc" like RegEx is. They used to be easy enough to do in the old Visual Studio environment but .Net has immutable strings, which makes it a pain in the neck.

    Hmm, maybe I will give it a try... 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff Moden (3/15/2009)


    Jack Corbett (3/15/2009)


    Barry,

    Just making sure you noticed, but Goldie said that the CLR is not an option in this case. I had suggested it originally as well.

    Heh... yeah... but it's still a great opportunity to further compare CLR solutions with T-SQL. I, for one, would like to see someone write the CLR... two are possible, I believe... one that uses RegEx (which will likely be the fastest) and one that does not.

    I agree, I just wanted Barry to be aware that this was not an instance where the OP could implement a CLR solution. So if he wanted to do it as a community service I'd enjoy it. I may actually give it a shot, if I can find time, even if my .NET skills aren't great. It might make for an article or at least a good blog post.

  • Goldie:

    What are the start and end search strings that you are using for the Script Tags (step 3)? I just have spaces...?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (3/15/2009)


    Goldie:

    What are the start and end search strings that you are using for the Script Tags (step 3)? I just have spaces...?

    I have to remove script tags <script </script>

    I have to remove style tags <style </style>

    I also have to remove all ascii encoded characters.

    The non-breaking spaces need to be replaced with a space.

  • Just the tags or everything that they contain also?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (3/15/2009)


    Just the tags or everything that they contain also?

    For script & style tags I need to remove everything they contain.

    For other HTML tags I can just remove the tags themselves.

Viewing 15 posts - 16 through 30 (of 60 total)

You must be logged in to reply to this topic. Login to reply