March 13, 2009 at 7:01 am
I have received a requirement to write a trigger that accepts a string and then performs the following actions.
1.Remove Script tags and contents
2.Replace all HTML tags with spaces.
3.Replace with a space
4.Replace all entity codes (“&xxx;”) with X
5.Replace all punctuation and math symbols (.,;:’”&()[]+/<>≥≤°÷) with a space (dashes are not replaced) Add percent, back slash, underscore, carat, asterisk, equal sign, curly braces, question and exclamation marks, pipes, dollar and cents signs, pound sign, tabs, crlf
6.Replace all digits with a space
7.Replace all one letter words with a space (pattern: space single-character-wildcard space)
8.Remove redundant spaces.
Of course, speedy performance is a must.
I wrote the following UDF which I am now looking to optimize.
Any thoughts?
set ANSI_NULLS ON
set QUOTED_IDENTIFIER ON
go
ALTER FUNCTION [dbo].[StripHTML]
(@HTMLText varchar(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
DECLARE @OriginalLength INT
-- 0. prepend and append a space to the string
-- this is so that single character words at the beginning of the string
SELECT @HtmlText = ' ' + @HtmlText + ' '
-- 1. Replace with an actual space
SELECT @HtmlText = REPLACE(@HtmlText, ' ', ' ')
-- 2. Replace all other character codes with an X
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0)
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
END
-- 3. Remove tags and contents
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = CHARINDEX(' ', @HTMLText, @Start)
SET @Length = (@End - @Start) + 9
WHILE (@Start > 0 AND @End > 0 AND @Length > 0)
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = CHARINDEX(' ', @HTMLText, @Start)
SET @Length = (@End - @Start) + 9
END
-- 4. Remove all HTML tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0)
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
END
-- 5. Remove all special characters
DECLARE @CleanString VARCHAR(MAX)
;WITH CTE_TALLY AS
(
-- Create a Tally CTE from 1 to whatever the length of the parameter is
SELECT TOP (LEN(@HTMLText)) ROW_NUMBER() OVER (ORDER BY t1.Object_ID) AS N
FROM Master.sys.All_Columns t1
CROSS JOIN Master.sys.All_Columns t2
)
SELECT @CleanString = ISNULL(@CleanString,'') + SUBSTRING(@HTMLText,N,1)
FROM CTE_TALLY WITH (NOLOCK)
WHERE N <= LEN(@HTMLText)
AND SUBSTRING(@HTMLText,N,1) NOT LIKE ('[~`!@#$%^&*()¢<>+_={}|\/?".,''≥≤°÷
0123456789]')
AND SUBSTRING(@HTMLText,N,1) NOT LIKE ']'
AND SUBSTRING(@HTMLText,N,1) NOT LIKE '['
-- 6. Remove single character words
WHILE PATINDEX('% _ %' , @CleanString) > 0
SET @CleanString = STUFF(@CleanString, PATINDEX('% _ %',@CleanString), 3, ' ')
-- 7. Remove all extra spaces
WHILE CharIndex(' ',@CleanString) > 0
SELECT @CleanString = Replace(@CleanString, ' ', ' ')
RETURN LTRIM(RTRIM(@CleanString))
END
March 13, 2009 at 7:45 am
I have to say that this sounds like the ideal place to create a CLR function so you can use Regular Expressions to clean up the data. I wish I could offer more advice than that, but I haven't done it so I can't provide any code do it.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
March 13, 2009 at 9:23 am
Jack Corbett (3/13/2009)
I have to say that this sounds like the ideal place to create a CLR function so you can use Regular Expressions to clean up the data. I wish I could offer more advice than that, but I haven't done it so I can't provide any code do it.
Unfortunately, we can not use CLR 🙁
I'm wondering if there is a smarter way to do any of the steps, or if multiple steps can be combined into one.
March 13, 2009 at 5:18 pm
Thoughts, since you have some logic laid out already.
First, I see you know how to use tally tables, so use them ... over and over again... get rid of all your while loops. Let me know if you need examples.
Second, if you are going to be running this against entire sets of rows, turn it into an inline table valued function. Use CROSS APPLY to return the cleaned up values. It will run much faster.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 14, 2009 at 11:08 am
Goldie Graber (3/13/2009)
I have received a requirement to write a trigger that accepts a string and then performs the following actions.1.Remove Script tags and contents
2.Replace all HTML tags with spaces.
3.Replace with a space
4.Replace all entity codes (“&xxx;”) with X
5.Replace all punctuation and math symbols (.,;:’”&()[]+/<>==°÷) with a space (dashes are not replaced) Add percent, back slash, underscore, carat, asterisk, equal sign, curly braces, question and exclamation marks, pipes, dollar and cents signs, pound sign, tabs, crlf
6.Replace all digits with a space
7.Replace all one letter words with a space (pattern: space single-character-wildcard space)
8.Remove redundant spaces.
Of course, speedy performance is a must.
Hi Goldie...
Not sure I can help but it would make it easier to try if you attached a file that has an example HTML that you're trying to clean up.
Also, just curious, why does this need to be done? I see your good "rules" above, but what is the purpose behind all of those rules?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2009 at 7:34 pm
Bob Hovious (3/13/2009)
Thoughts, since you have some logic laid out already.First, I see you know how to use tally tables, so use them ... over and over again... get rid of all your while loops. Let me know if you need examples.
Second, if you are going to be running this against entire sets of rows, turn it into an inline table valued function. Use CROSS APPLY to return the cleaned up values. It will run much faster.
I'm not sure how to use a tally table to get rid of the HTML tags.
I'm also thinking there should be a way to combine some of my steps, but I can't seem to figure out how.
March 14, 2009 at 7:39 pm
Jeff Moden (3/14/2009)
Hi Goldie...
Not sure I can help but it would make it easier to try if you attached a file that has an example HTML that you're trying to clean up.
Also, just curious, why does this need to be done? I see your good "rules" above, but what is the purpose behind all of those rules?
The purpose of this function is actually to get a word count. We have to first strip out anything that is not a a word and then count. The last requirement on the list is:
9. Count words. If (length of new string) > 0 (number of words = (length of new string) – (length of new string with spaces removed) + 1) else 0
I have attached a sample HTML snippet.
March 14, 2009 at 7:43 pm
Goldie Graber (3/14/2009)
Jeff Moden (3/14/2009)
Hi Goldie...
Not sure I can help but it would make it easier to try if you attached a file that has an example HTML that you're trying to clean up.
Also, just curious, why does this need to be done? I see your good "rules" above, but what is the purpose behind all of those rules?
The purpose of this function is actually to get a word count. We have to first strip out anything that is not a a word and then count. The last requirement on the list is:
9. Count words. If (length of new string) > 0 (number of words = (length of new string) – (length of new string with spaces removed) + 1) else 0
I have attached a sample HTML snippet.
Something happened in the transalation... it looks like plain text rather than a sample of HTML. Could you attach a sample of the HTML you're trying to strip out?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2009 at 8:11 pm
Download the file and save as text.
If you try to view it in your browser it will translate the HTML tags.
March 14, 2009 at 9:37 pm
Goldie Graber (3/14/2009)
Download the file and save as text.If you try to view it in your browser it will translate the HTML tags.
Ack... of course... stupid me. Thanks, Goldie.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2009 at 10:42 pm
I know that this'll drive Jeff crazy, but I actually think that this would work better as a CLR Trigger.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 14, 2009 at 10:56 pm
As much as I praise their use, there are some things that the Tally table won't beat a WHILE loop at... cleaning strings using STUFF is one of those things provided that you do it just right. Here's what I came up with...
--===== Replace all HTML tags with nothing
WHILE CHARINDEX(' CHARINDEX('<',@HTMLText)
SELECT @HTMLText = STUFF(@HTMLText,
CHARINDEX('<',@HTMLText),
CHARINDEX('>',@HTMLText)-CHARINDEX('<',@HTMLText)+1,
'')
--===== Replace all EntityCodes with the actual ascii character (Just to be safe instead of dumping them)
WHILE CHARINDEX('&',@HTMLText) > 0 AND CHARINDEX(';',@HTMLText) > CHARINDEX('&',@HTMLText)
AND CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)=4
SELECT @HTMLText = STUFF(@HTMLText,
CHARINDEX('&',@HTMLText),
CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)+1,
CHAR(SUBSTRING(@HTMLText,CHARINDEX('&',@HTMLText)+1,3)))
--===== Replace all special characters (except dashes and spaces) and digits with a space
WHILE PATINDEX('%[^- A-Z]%',@HTMLText) > 0
SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('%[^- A-Z]%',@HTMLText),1,' ')
--===== Replace single letter words with a space
WHILE PATINDEX('% _ %',@HTMLText) > 0
SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('% _ %',@HTMLText),3,' ')
--===== Replace multiple spaces with a single space
WHILE CHARINDEX(' ',@HTMLText) > 0
SELECT @HTMLText = REPLACE(@HTMLText,' ',' ')
--===== Drop any leading or trailing spaces
SELECT @HTMLText = LTRIM(RTRIM(@HTMLText))
And, here's what the result looks like... (I let it wrap on this screen)...
DOSAGE Adults-- to teaspoonfuls Children six to twelve-- to teaspoonful Children two to six years-- teaspoonful These doses may be repeated in four hours if necessary but not more than four times in twenty-four hours WARNING Persons with high fever or persistent cough should not use this preparation unless directed by physician Do not exceed recommended dosage
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2009 at 11:01 pm
RBarryYoung (3/14/2009)
I know that this'll drive Jeff crazy, but I actually think that this would work better as a CLR Trigger.
It might... but Nah... "we don need no stinkin' CLR" 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2009 at 11:02 pm
Sorry Goldie... I forgot the last step... I'll be right back...
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2009 at 11:07 pm
Jeff Moden (3/14/2009)
RBarryYoung (3/14/2009)
I know that this'll drive Jeff crazy, but I actually think that this would work better as a CLR Trigger.It might... but Nah... "we don need no stinkin' CLR" 😛
Uh, oh. Now that sounds like a challenge! ... 😀
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
Viewing 15 posts - 1 through 15 (of 60 total)
You must be logged in to reply to this topic. Login to reply