strip HTML tages

  • I have a text field that contains HTML tags. I need to strip these off before reporting them in SQL reporting services. 

     I know I can write a code to strip off the HTML tags.  But I would like to do it in a stored proceedure.

    Anyone have any luck with this?

     

     

  • Can you give an example of how one of your fields would appear with the tags?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • <OL>

    <LI>

    <DIV align=center>Store 1151 has a USB only Dell <BR>The <U><EM><STRONG><FONT color=#ff6633>New KVM is PS2 only</FONT></STRONG></EM></U> <BR><BR>They need a replacment KVM that is USB. <BR><BR><A href="http://www.yahoo.com"><FONT style="BACKGROUND-COLOR: #0066ff">Currently using two monitors</FONT></A> and keyboards. <BR><BR><BR></DIV></LI></OL>

  • this is kind of hard to do; i use an extended stored procedure in order to strip RTF or HTML tags;

    see this post for the details, other examples, etc:

    http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=150&messageid=242853#bm243180

     

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • If I understand you correctly, I have the following suggestion:

     

    declare @STR varchar(8000)

    declare @str2 varchar(8000)

    select @STR =

    '<OL>

    <LI>

    <DIV align=center>Store 1151 has a USB only Dell <BR>The <U><EM><STRONG><FONT color=#ff6633>New KVM is PS2 only</FONT></STRONG></EM></U> <BR><BR>They need a replacment KVM that is USB. <BR><BR><A href="http://www.yahoo.com"><FONT style="BACKGROUND-COLOR: #0066ff">Currently using two monitors</FONT></A> and keyboards. <BR><BR><BR></DIV></LI></OL>'

    select @str2 = ''

    select @str2 = @str2 + substring(Item, charindex('>', Item) + 1, 8000) from dbo.splitString(@str, '<') order by SeqNo

    select @str2

     

    This uses a user-defined function splitString, which can be implemented in many ways. One of the most efficient (taken from other posts here at SSC) uses a Numbers table. For completeness I have also included a definition of this table:

     

    SELECT TOP 10000 n = IDENTITY(INT, 1, 1) INTO Numbers

    FROM

        sysobjects a1

      CROSS JOIN

        sysobjects a2

      CROSS JOIN

        sysobjects a3

    ALTER TABLE Numbers

            ADD CONSTRAINT Index_Numbers PRIMARY KEY CLUSTERED(n)

    GO

     

    CREATE FUNCTION dbo.SplitString

    (

     @vcrArray varchar(8000),

     @chrDelim char(1)

    )

    RETURNS

    @Results TABLE

    (

     SeqNo int identity(1, 1),

     Item varchar(8000)

    )

    AS

    BEGIN

     INSERT INTO @Results (Item)

     SELECT SUBSTRING(@vcrArray, n, CHARINDEX(@chrDelim, @vcrArray + @chrDelim, n) - n)

     FROM Numbers

     where

     n <= LEN(@vcrArray)

     AND SUBSTRING(@chrDelim + @vcrArray, n, 1) = @chrDelim

     ORDER BY n

     RETURN

    END

    go

  • Create Function dbo.udf_STRIP_HTML

    (

    @Dirty nvarchar(4000)

    )

    returns nvarchar(4000)

    as

    begin

    declare @Start int,

    @End int,

    @Length int

    while charIndex(' 0 And charIndex('>', @Dirty, charIndex(' 0

    begin

    select @Start = charIndex('', @Dirty, charIndex(' 0

    begin

    select @Dirty = stuff(@Dirty, @Start, @Length, '')

    end

    end

    return @Dirty

    end

  • Jesper and Rob,

    I didn't study the SQL provided in your posts, but it seems very similar to the code I wrote this morning when I first saw this question. However, I saw that it has certain limitations and didn't post it... if the text contains the character "<" or ">" in a different meaning, like for example:

    <br><b>When input < 0, display error.</b>Else continue.<br>

    it considers the < (from < 0) as a start tag and result is "When input Else continue." That is obviously incorrect... As I see, Rob's SQL comes with the same incorrect result; not sure about Jesper's, but I suspect that it would be the same.

    As I see it, there is no other way than to use something that really knows all the HTML tags and can differentiate between tags and less than/greater than comparisions - which is probably the extended procedure mentioned before in one of the first replies.

  • <br><b>When input < 0, display error.</b>Else continue.<br>

    is incorrect HTML (or at least incorrect XHTML). You should use

    <br><b>When input &lt; 0, display error.</b>Else continue.<br>

     

    Rob has presented a simple and straightforward solution, which is easier to read and maintain. However, I think my suggestion is better when it comes to performance....

    I think we need some feedback from ms. Wright...

  • OK, I'm not that experienced in HTML so I better supposed that such possibility (< or > inside the text) exists. If that does not happen, everything is fine. I just checked my code with Rob's and it is very similar:

    CREATE FUNCTION dbo.remove_tags (@text varchar(1024))

    RETURNS VARCHAR(1024)

    AS

    BEGIN

    WHILE CHARINDEX('<',@text) > 0 AND CHARINDEX('>',@text) > 1

     BEGIN

     SELECT @text = SUBSTRING(@text, 0, CHARINDEX('<',@text))+SUBSTRING(@text,CHARINDEX('>',@text)+1,LEN(@text))

     END

    RETURN @text

    END

    As I said, it ONLY works if the characters < and > are used exclusively to mark start and end of a tag. As soon as this can not be guaranteed, it will run into serious problems... Most of them can be avoided with the following check - everything with the exception of a text that contains both < and > not meant as tags, and only if it is in this order, will be either processed correctly or return the string 'ERROR'. Only texts like '<br>A < 21 and B > 99<br>' will return incorrect result, without any warning. I don't think this can be avoided in any way, unless you explicitly name the valid tags.

    ALTER FUNCTION dbo.remove_tags (@text varchar(1024))

    RETURNS VARCHAR(1024)

    AS

    BEGIN

    WHILE CHARINDEX('<',@text) > 0 AND CHARINDEX('>',@text) > 1

     BEGIN

     IF CHARINDEX('>',@text)< CHARINDEX('<',@text)

      OR CHARINDEX('<',SUBSTRING(@text,CHARINDEX('<',@text)+1,CHARINDEX('>',@text)-CHARINDEX('<',@text)))> 0

     SELECT @text = 'ERROR'

     ELSE SELECT @text = SUBSTRING(@text, 0, CHARINDEX('<',@text))+SUBSTRING(@text,CHARINDEX('>',@text)+1,LEN(@text))

     END

    RETURN @text

    END

    BTW Rob, I like your idea to use STUFF with empty string... it didn't occur to me, so I stuck to classical solution with substrings.

  • as Vladan identified, malformed HTML will result in an unwanted result from regular expressions.

    here i used the extended stored proc i mentioned before:

     

    declare @string varchar(8000),@regExpr varchar(500)

    set @string='<OL>

    <LI>

    <DIV align=center>Store 1151 has a USB only Dell <BR>The <U><EM><STRONG><FONT color=#ff6633>New KVM is PS2 only</FONT></STRONG></EM></U> <BR><BR>They need a replacment KVM that is USB. <BR><BR><A href="http://www.yahoo.com"><FONT style="BACKGROUND-COLOR: #0066ff">Currently using two monitors</FONT></A> and keyboards. <BR><BR><BR></DIV></LI></OL>

    '

    set @regExpr='<(.|\n)*?>'

    select dbo.fn_regex_replace(@string,@regExpr,'') as results

    set @string='<br><b>When input < 0, display error.</b>Else continue.<br>'

    select dbo.fn_regex_replace(@string,@regExpr,'') as results

    results

    Store 1151 has a USB only Dell The New KVM is PS2 only They need a replacment KVM that is USB. Currently using two monitors and keyboards.

    results

    When input Else continue.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • My suggestion transforms

    '<br><b>When input < 0, display error.</b>Else continue.<br>'

    to

    'When input  0, display error.Else continue.'

     

    And

    '<br><b>When input > 0, display error.</b>Else continue.<br>'

    is actually transformed to

    'When input > 0, display error.Else continue.'

     

    But this extended stored procedure thing is interesting... What do I need to get this fn_regex_replace thing working?

  • it's godawful easy; download the source fromt eh web link i posted.

    copy the dll from the release folder to your c:\program files\microsoft sql server\$instancename\Binn directory.

    run the .SQL script that is included in the source.

    this is one of those tools that you only appreciate after you start using it; doing things like cleaning update for addresses, stripping RTF or HTML tags are just a few of the many uses i've used it for

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks, I will give it a try...

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply