January 18, 2006 at 1:25 pm
I have a text field that contains HTML tags. I need to strip these off before reporting them in SQL reporting services.
I know I can write a code to strip off the HTML tags. But I would like to do it in a stored proceedure.
Anyone have any luck with this?
January 18, 2006 at 1:56 pm
January 18, 2006 at 3:33 pm
<OL>
<LI>
<DIV align=center>Store 1151 has a USB only Dell <BR>The <U><EM><STRONG><FONT color=#ff6633>New KVM is PS2 only</FONT></STRONG></EM></U> <BR><BR>They need a replacment KVM that is USB. <BR><BR><A href="http://www.yahoo.com"><FONT style="BACKGROUND-COLOR: #0066ff">Currently using two monitors</FONT></A> and keyboards. <BR><BR><BR></DIV></LI></OL>
January 18, 2006 at 3:33 pm
this is kind of hard to do; i use an extended stored procedure in order to strip RTF or HTML tags;
see this post for the details, other examples, etc:
http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=150&messageid=242853#bm243180
Lowell
January 19, 2006 at 1:33 am
If I understand you correctly, I have the following suggestion:
declare @STR varchar(8000)
declare @str2 varchar(8000)
select @STR =
'<OL>
<LI>
<DIV align=center>Store 1151 has a USB only Dell <BR>The <U><EM><STRONG><FONT color=#ff6633>New KVM is PS2 only</FONT></STRONG></EM></U> <BR><BR>They need a replacment KVM that is USB. <BR><BR><A href="http://www.yahoo.com"><FONT style="BACKGROUND-COLOR: #0066ff">Currently using two monitors</FONT></A> and keyboards. <BR><BR><BR></DIV></LI></OL>'
select @str2 = ''
select @str2 = @str2 + substring(Item, charindex('>', Item) + 1, 8000) from dbo.splitString(@str, '<') order by SeqNo
select @str2
This uses a user-defined function splitString, which can be implemented in many ways. One of the most efficient (taken from other posts here at SSC) uses a Numbers table. For completeness I have also included a definition of this table:
SELECT TOP 10000 n = IDENTITY(INT, 1, 1) INTO Numbers
FROM
sysobjects a1
CROSS JOIN
sysobjects a2
CROSS JOIN
sysobjects a3
ALTER TABLE Numbers
ADD CONSTRAINT Index_Numbers PRIMARY KEY CLUSTERED(n)
GO
CREATE FUNCTION dbo.SplitString
(
@vcrArray varchar(8000),
@chrDelim char(1)
)
RETURNS
@Results TABLE
(
SeqNo int identity(1, 1),
Item varchar(8000)
)
AS
BEGIN
INSERT INTO @Results (Item)
SELECT SUBSTRING(@vcrArray, n, CHARINDEX(@chrDelim, @vcrArray + @chrDelim, n) - n)
FROM Numbers
where
n <= LEN(@vcrArray)
AND SUBSTRING(@chrDelim + @vcrArray, n, 1) = @chrDelim
ORDER BY n
RETURN
END
go
January 19, 2006 at 4:52 am
Create Function dbo.udf_STRIP_HTML
(
@Dirty nvarchar(4000)
)
returns nvarchar(4000)
as
begin
declare @Start int,
@End int,
@Length int
while charIndex(' 0 And charIndex('>', @Dirty, charIndex(' 0
begin
select @Start = charIndex('', @Dirty, charIndex(' 0
begin
select @Dirty = stuff(@Dirty, @Start, @Length, '')
end
end
return @Dirty
end
January 19, 2006 at 5:10 am
Jesper and Rob,
I didn't study the SQL provided in your posts, but it seems very similar to the code I wrote this morning when I first saw this question. However, I saw that it has certain limitations and didn't post it... if the text contains the character "<" or ">" in a different meaning, like for example:
<br><b>When input < 0, display error.</b>Else continue.<br>
it considers the < (from < 0) as a start tag and result is "When input Else continue." That is obviously incorrect... As I see, Rob's SQL comes with the same incorrect result; not sure about Jesper's, but I suspect that it would be the same.
As I see it, there is no other way than to use something that really knows all the HTML tags and can differentiate between tags and less than/greater than comparisions - which is probably the extended procedure mentioned before in one of the first replies.
January 19, 2006 at 5:22 am
<br><b>When input < 0, display error.</b>Else continue.<br>
is incorrect HTML (or at least incorrect XHTML). You should use
<br><b>When input < 0, display error.</b>Else continue.<br>
Rob has presented a simple and straightforward solution, which is easier to read and maintain. However, I think my suggestion is better when it comes to performance....
I think we need some feedback from ms. Wright...
January 19, 2006 at 6:40 am
OK, I'm not that experienced in HTML so I better supposed that such possibility (< or > inside the text) exists. If that does not happen, everything is fine. I just checked my code with Rob's and it is very similar:
CREATE FUNCTION dbo.remove_tags (@text varchar(1024))
RETURNS VARCHAR(1024)
AS
BEGIN
WHILE CHARINDEX('<',@text) > 0 AND CHARINDEX('>',@text) > 1
BEGIN
SELECT @text = SUBSTRING(@text, 0, CHARINDEX('<',@text))+SUBSTRING(@text,CHARINDEX('>',@text)+1,LEN(@text))
END
RETURN @text
END
As I said, it ONLY works if the characters < and > are used exclusively to mark start and end of a tag. As soon as this can not be guaranteed, it will run into serious problems... Most of them can be avoided with the following check - everything with the exception of a text that contains both < and > not meant as tags, and only if it is in this order, will be either processed correctly or return the string 'ERROR'. Only texts like '<br>A < 21 and B > 99<br>' will return incorrect result, without any warning. I don't think this can be avoided in any way, unless you explicitly name the valid tags.
ALTER FUNCTION dbo.remove_tags (@text varchar(1024))
RETURNS VARCHAR(1024)
AS
BEGIN
WHILE CHARINDEX('<',@text) > 0 AND CHARINDEX('>',@text) > 1
BEGIN
IF CHARINDEX('>',@text)< CHARINDEX('<',@text)
OR CHARINDEX('<',SUBSTRING(@text,CHARINDEX('<',@text)+1,CHARINDEX('>',@text)-CHARINDEX('<',@text)))> 0
SELECT @text = 'ERROR'
ELSE SELECT @text = SUBSTRING(@text, 0, CHARINDEX('<',@text))+SUBSTRING(@text,CHARINDEX('>',@text)+1,LEN(@text))
END
RETURN @text
END
BTW Rob, I like your idea to use STUFF with empty string... it didn't occur to me, so I stuck to classical solution with substrings.
January 19, 2006 at 6:55 am
as Vladan identified, malformed HTML will result in an unwanted result from regular expressions.
here i used the extended stored proc i mentioned before:
declare @string varchar(8000),@regExpr varchar(500)
set @string='<OL>
<LI>
<DIV align=center>Store 1151 has a USB only Dell <BR>The <U><EM><STRONG><FONT color=#ff6633>New KVM is PS2 only</FONT></STRONG></EM></U> <BR><BR>They need a replacment KVM that is USB. <BR><BR><A href="http://www.yahoo.com"><FONT style="BACKGROUND-COLOR: #0066ff">Currently using two monitors</FONT></A> and keyboards. <BR><BR><BR></DIV></LI></OL>
'
set @regExpr='<(.|\n)*?>'
select dbo.fn_regex_replace(@string,@regExpr,'') as results
set @string='<br><b>When input < 0, display error.</b>Else continue.<br>'
select dbo.fn_regex_replace(@string,@regExpr,'') as results
results
Store 1151 has a USB only Dell The New KVM is PS2 only They need a replacment KVM that is USB. Currently using two monitors and keyboards.
results
When input Else continue.
Lowell
January 19, 2006 at 7:06 am
My suggestion transforms
'<br><b>When input < 0, display error.</b>Else continue.<br>'
to
'When input 0, display error.Else continue.'
And
'<br><b>When input > 0, display error.</b>Else continue.<br>'
is actually transformed to
'When input > 0, display error.Else continue.'
But this extended stored procedure thing is interesting... What do I need to get this fn_regex_replace thing working?
January 19, 2006 at 7:29 am
it's godawful easy; download the source fromt eh web link i posted.
copy the dll from the release folder to your c:\program files\microsoft sql server\$instancename\Binn directory.
run the .SQL script that is included in the source.
this is one of those tools that you only appreciate after you start using it; doing things like cleaning update for addresses, stripping RTF or HTML tags are just a few of the many uses i've used it for
Lowell
January 20, 2006 at 2:33 am
Thanks, I will give it a try...
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply