Stripping HTML tags from data

  • I have a text field that is written to by a propritary database. In addition to the data I want, it writes a lot of HTML tags as well, since it uses those to display this field in a pretty format in the application. I want to remove all HTML tags.

    I've been playing with this all morning, but my methods aren't working very well. Any ideas? Every single piece of text that I want to strip out is in <>s.

    Thanks!

  • This is one good way.  However there's a faster way if you have only 1 text information per row. 

     

    CREATE FUNCTION dbo.fnStripHtmlTags (@Text AS VARCHAR(8000))

    RETURNS VARCHAR(8000)

    AS

    BEGIN

     DECLARE @Return AS VARCHAR(8000)

     DECLARE @StartPos AS SMALLINT

     DECLARE @Loops AS SMALLINT

     SET @Loops = LEN(@Text) - LEN (REPLACE(@Text, '<', ''))

     WHILE @Loops > 0

     BEGIN

      SET @StartPos = CHARINDEX ('<', @Text, 1)

      SET @Text = STUFF(@Text, @StartPos, CHARINDEX ('>', @Text, @StartPos + 1) - @StartPos + 1, '') 

      SET @Loops = @Loops - 1

     END

     SET @Return = @Text

     RETURN @Return

    END

     

  • Wow - that did just what I needed and worked amazingly well.

    I am in awe!

    And even better, I think I understand it.  (I'm a total n00b).

    Thanks much!

  • Heh... my turn Remi... how come no Tally table here?  "I'm surprised you didn't think of it"

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Old code... didn't have time to rethink and rewrite.  But I'm sure you already have the "good" answer ready so what not just post it already !

  • I now remember why I don't have a set based version for this one... I just can't beat the loop into submission .  This is the closest I'm come to matching the speeds, but this version takes about 30% longer to run than the loop version!

     

    IF EXISTS (SELECT * FROM dbo.SysObjects WHERE name = 'fnStripHtmlTags_SET4' AND XType = 'fn' AND USER_NAME(uid) = 'dbo')

     DROP FUNCTION dbo.fnStripHtmlTags_SET4

    GO

    --this code assumes that more than one section of text can be found for each line.

    CREATE FUNCTION dbo.fnStripHtmlTags_SET4 (@Text AS VARCHAR(8000))

    RETURNS VARCHAR(8000)

    AS

    BEGIN

     DECLARE @Loops AS SMALLINT

     SET @Loops = LEN(@Text) - LEN (REPLACE(@Text, '<', ''))

     --SELECT  @Text = CASE WHEN CHARINDEX ('<', @Text, 1) * CHARINDEX ('>', @Text, 1) > 0

     SELECT   @Text = STUFF(@Text, CHARINDEX ('<', @Text, 1), CHARINDEX ('>', @Text, 1) - CHARINDEX ('<', @Text, 1) + 1, '')

     FROM   dbo.Numbers

     WHERE   PkNumber <= @Loops

      

     RETURN @Text

    END

    GO

     

    Those just don't even come close to it (was just having fun at that point ) :

    IF EXISTS (SELECT * FROM dbo.SysObjects WHERE name = 'fnStripHtmlTags_SET' AND XType = 'fn' AND USER_NAME(uid) = 'dbo')

     DROP FUNCTION dbo.fnStripHtmlTags_SET

    GO

    --this code assumes that more than one section of text can be found for each line.

    CREATE FUNCTION dbo.fnStripHtmlTags_SET (@Text AS VARCHAR(8000))

    RETURNS VARCHAR(8000)

    AS

    BEGIN

     DECLARE @Length AS SMALLINT

     SET @Length = LEN(@Text)

     SELECT

       @Text = STUFF(@Text, dtMap.StartPos, dtMap.EndPos - dtMap.StartPos, '')

     FROM

       (

       SELECT  TOP 100 PERCENT

           MAX(dtStart.PkNumber) AS StartPos

         , dtEnd.PkNumber AS EndPos

       FROM   (

         SELECT   PkNumber

         FROM   dbo.Numbers

         WHERE   CHARINDEX ('<', @Text, PkNumber) = PkNumber

           AND PkNumber <= @Length

        &nbsp dtStart

         INNER JOIN

         (

         SELECT   PkNumber + 1 AS PkNumber

         FROM   dbo.Numbers

         WHERE   CHARINDEX ('>', @Text, PkNumber) = PkNumber

           AND PkNumber <= @Length

        &nbsp dtEnd

          ON dtStart.PkNumber < dtEnd.PkNumber

       GROUP BY   dtEnd.PkNumber

       ORDER BY  EndPos DESC

      &nbsp dtMap

     RETURN @Text

    END

    IF EXISTS (SELECT * FROM dbo.SysObjects WHERE name = 'fnStripHtmlTags_SET2' AND XType = 'fn' AND USER_NAME(uid) = 'dbo')

     DROP FUNCTION dbo.fnStripHtmlTags_SET2

    GO

    --this code assumes that more than one section of text can be found for each line.

    CREATE FUNCTION dbo.fnStripHtmlTags_SET2 (@Text AS VARCHAR(8000))

    RETURNS VARCHAR(8000)

    AS

    BEGIN

     DECLARE @Length AS SMALLINT

     SET @Length = LEN(@Text)

     SELECT

       @Text = STUFF(@Text, dtMap.StartPos, dtMap.EndPos - dtMap.StartPos, '')

     FROM

       (

       SELECT    TOP 100 PERCENT

           PkNumber AS StartPos

         , CHARINDEX ('>', @Text, PkNumber) AS EndPos

       FROM   dbo.Numbers

       WHERE   CHARINDEX ('<', @Text, PkNumber) = PkNumber

         AND PkNumber <= @Length

       ORDER BY  EndPos DESC

      &nbsp dtMap

     RETURN @Text

    END

    GO

    IF EXISTS (SELECT * FROM dbo.SysObjects WHERE name = 'fnStripHtmlTags_SET3' AND XType = 'fn' AND USER_NAME(uid) = 'dbo')

     DROP FUNCTION dbo.fnStripHtmlTags_SET3

    GO

    --this code assumes that more than one section of text can be found for each line.

    CREATE FUNCTION dbo.fnStripHtmlTags_SET3 (@Text AS VARCHAR(8000))

    RETURNS VARCHAR(8000)

    AS

    BEGIN

     DECLARE @Length AS SMALLINT

     SET @Length = LEN(@Text)

     SELECT

       @Text = STUFF(@Text, dtMap.StartPos, dtMap.EndPos - dtMap.StartPos, '')

     FROM

       (

       SELECT    TOP 100 PERCENT

           PkNumber AS StartPos

         , CASE WHEN CHARINDEX ('<', @Text, PkNumber) = PkNumber THEN CHARINDEX ('>', @Text, PkNumber) ELSE NULL END AS EndPos

       FROM   dbo.Numbers

       WHERE   CHARINDEX ('<', @Text, PkNumber) = PkNumber

         AND PkNumber <= @Length

       ORDER BY  EndPos DESC

      &nbsp dtMap

     RETURN @Text

    END

    GO

     

     

    PS I don't have a create testtable script because I loaded a big html page into a table.  I assume you'll be able to find one of those .

  • should u choose regular expressions route for this task, then match <..> substrings with

    <[^<>]*>

    and then replace the matches with

    [empty string]

    thus eliminating them from the original string

  • I have no idea of what you are trying to say... can you show a working example with this data?

     

    DECLARE @x AS VARCHAR(8000)

    SET @x = '  <table>test3<font>sometext</font>tex2</table>'

    SELECT PATINDEX ('<[^<>]*>', @x)

  • I think that's just meant to be some strange gremlin emoticon.

  • Maybe but I still don't get his idea.

    I can manually replace the occurance of "<" to ">" to something else but I need to know how to do it faster.  Is there any function that acts as a combinaison of patindex and stuff where the whole pattern could be replaced to something else??

     

    If noone can offer something more, then I guess this is one of those exceptions where set based can't even be as fast as procedural.

  • Try this:

    DECLARE @Text varchar(8000)

    SET @Text = '"X<5"'

    SELECT dbo.fnStripHtmlTags (@Text)

    Remove space in "< Condition".

    _____________
    Code for TallyGenerator

  • David I:

     

    It is not a gremlin emoticon. It is a regular expression pattern saying:

     

    - match *<* then a series of characters [0 to N in count] which are NOT *<* or *>*, then match *>*

     

    i.e. exactly what original poster had in mind.

  • Remi,

     

    With <[^<>]*>, I just gave a generic regular expression for matching a generic <…>  tag.

     

    Actual implementation of the Regex will be different in SQL Server 2005 and SQL Server 2000.

    In SQL Server 2000 that doesn’t have native Regex support, one can resort to Server 2000 Extended Stored Procedure

     

    http://www.codeproject.com/managedcpp/xpregex.asp

     

    to be able do match/replace operation using  xp_regex_replace procedure (cf. the link)

     

    In SQL Server 2005 u can create (to be cont.).....

     

  • I'm pretty sure I was kidding when I posted that but it was so long ago that I forget

  • For normal simple mind people (like me, for example) everyone who understands regular expressions so well is kind of gremlin.

    And you are outrageous in it.

    _____________
    Code for TallyGenerator

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply