March 14, 2009 at 11:15 pm
Here's the code with the last step of counting the actual words...
--===== Replace all HTML tags with nothing
WHILE CHARINDEX(' CHARINDEX('<',@HTMLText)
SELECT @HTMLText = STUFF(@HTMLText,
CHARINDEX('<',@HTMLText),
CHARINDEX('>',@HTMLText)-CHARINDEX('<',@HTMLText)+1,
'')
--===== Replace all EntityCodes with the actual ascii character (Just to be safe instead of dumping them)
WHILE CHARINDEX('&',@HTMLText) > 0 AND CHARINDEX(';',@HTMLText) > CHARINDEX('&',@HTMLText)
AND CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)=4
SELECT @HTMLText = STUFF(@HTMLText,
CHARINDEX('&',@HTMLText),
CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)+1,
CHAR(SUBSTRING(@HTMLText,CHARINDEX('&',@HTMLText)+1,3)))
--===== Replace all special characters (except dashes and spaces) and digits with a space
WHILE PATINDEX('%[^- A-Z]%',@HTMLText) > 0
SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('%[^- A-Z]%',@HTMLText),1,' ')
--===== Replace single letter words with a space
WHILE PATINDEX('% _ %',@HTMLText) > 0
SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('% _ %',@HTMLText),3,' ')
--===== Replace multiple spaces with a single space
WHILE CHARINDEX(' ',@HTMLText) > 0
SELECT @HTMLText = REPLACE(@HTMLText,' ',' ')
--===== Drop any leading or trailing spaces
SELECT @HTMLText = LTRIM(RTRIM(@HTMLText))
--===== Now, count the words
SELECT ISNULL(NULLIF(LEN(@HTMLText),0)-LEN(REPLACE(@HTMLText,' ','')) + 1, 0)
Since these are memory only single-step WHILE loops, as much as I hate to admit it, they'll beat the Tally table quite handily. They'll probably be fast enough to smoke everything except an honest to goodness RegEx CLR but they'll probably be fast enough to make that not worth it.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2009 at 11:19 pm
RBarryYoung (3/14/2009)
Jeff Moden (3/14/2009)
RBarryYoung (3/14/2009)
I know that this'll drive Jeff crazy, but I actually think that this would work better as a CLR Trigger.It might... but Nah... "we don need no stinkin' CLR" 😛
Uh, oh. Now that sounds like a challenge! ... 😀
The gauntlet has, indeed, been thrown... go for it. But, heh... I'll have to take your word for it because I never intend to enable CLR's on my machines. 😀 For that matter, being the data troll that I am, I gave up "GUI" code a long time ago and can't even spell C#. :hehe:
--Jeff Moden
Change is inevitable... Change for the better is not.
March 15, 2009 at 6:08 am
Jeff Moden (3/14/2009)
The gauntlet has, indeed, been thrown... go for it. But, heh... I'll have to take your word for it because I never intend to enable CLR's on my machines. 😀 For that matter, being the data troll that I am, I gave up "GUI" code a long time ago and can't even spell C#. :hehe:
Oh, don't worry, I prefer vb.Net anyway. Uh, for personal use, of course. Professionally, I don't care either way. 🙂
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 15, 2009 at 6:23 am
Goldie: in Rule 2 (Step 1) what character are you replacing with a space? All of my attempts to extract/examine it from your post show it as a space too.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 15, 2009 at 6:48 am
Barry,
Just making sure you noticed, but Goldie said that the CLR is not an option in this case. I had suggested it originally as well.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
March 15, 2009 at 9:02 am
Jack Corbett (3/15/2009)
Barry,Just making sure you noticed, but Goldie said that the CLR is not an option in this case. I had suggested it originally as well.
Heh... yeah... but it's still a great opportunity to further compare CLR solutions with T-SQL. I, for one, would like to see someone write the CLR... two are possible, I believe... one that uses RegEx (which will likely be the fastest) and one that does not.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 15, 2009 at 9:37 am
RBarryYoung (3/15/2009)
Goldie: in Rule 2 (Step 1) what character are you replacing with a space? All of my attempts to extract/examine it from your post show it as a space too.
It is the character code for a non break space. ( nbsp )
I didn't realize the forum decoded it.
March 15, 2009 at 9:40 am
Jeff, that looks awesome.
I'm going to test it out at work tomorrow.
One part I don't quite understand is this:
--===== Replace all EntityCodes with the actual ascii character (Just to be safe instead of dumping them)
WHILE CHARINDEX('&',@HTMLText) > 0 AND CHARINDEX(';',@HTMLText) > CHARINDEX('&',@HTMLText)
AND CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)=4
March 15, 2009 at 10:29 am
Erp! :blush: Well that takes all the fun out of it...
Oh well, maybe I can come up with an interesting alternate approach... 🙂
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 15, 2009 at 11:19 am
Jeff Moden (3/15/2009)
Jack Corbett (3/15/2009)
Barry,Just making sure you noticed, but Goldie said that the CLR is not an option in this case. I had suggested it originally as well.
Heh... yeah... but it's still a great opportunity to further compare CLR solutions with T-SQL. I, for one, would like to see someone write the CLR... two are possible, I believe... one that uses RegEx (which will likely be the fastest) and one that does not.
Actually, the fastest way to do these is almost always a DFSA Transducer (deterministic finite-state automaton), but they are definitely not "ad-hoc" like RegEx is. They used to be easy enough to do in the old Visual Studio environment but .Net has immutable strings, which makes it a pain in the neck.
Hmm, maybe I will give it a try... 🙂
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 15, 2009 at 11:19 am
Jeff Moden (3/15/2009)
Jack Corbett (3/15/2009)
Barry,Just making sure you noticed, but Goldie said that the CLR is not an option in this case. I had suggested it originally as well.
Heh... yeah... but it's still a great opportunity to further compare CLR solutions with T-SQL. I, for one, would like to see someone write the CLR... two are possible, I believe... one that uses RegEx (which will likely be the fastest) and one that does not.
I agree, I just wanted Barry to be aware that this was not an instance where the OP could implement a CLR solution. So if he wanted to do it as a community service I'd enjoy it. I may actually give it a shot, if I can find time, even if my .NET skills aren't great. It might make for an article or at least a good blog post.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
March 15, 2009 at 12:29 pm
Goldie:
What are the start and end search strings that you are using for the Script Tags (step 3)? I just have spaces...?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 15, 2009 at 5:40 pm
RBarryYoung (3/15/2009)
Goldie:What are the start and end search strings that you are using for the Script Tags (step 3)? I just have spaces...?
I have to remove script tags <script </script>
I have to remove style tags <style </style>
I also have to remove all ascii encoded characters.
The non-breaking spaces need to be replaced with a space.
March 15, 2009 at 5:47 pm
Just the tags or everything that they contain also?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 15, 2009 at 9:06 pm
RBarryYoung (3/15/2009)
Just the tags or everything that they contain also?
For script & style tags I need to remove everything they contain.
For other HTML tags I can just remove the tags themselves.
Viewing 15 posts - 16 through 30 (of 60 total)
You must be logged in to reply to this topic. Login to reply