Stripping HTML tags from data

  • i just realised that task like this has been discussed already:

    http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=8&messageid=320575&p=2

    and i posted some code for SQL Server 2005 there.

    after the CLR dbo.Regex_Replace is compiled in .NET, all u need is one function call to do the cleaning task

    select mystring, dbo.Regex_Replace(mystring, '<[^<>]*>', '')  as myCleanString from dbo.X

  • Sergiy,

    *normal simple mind people* is the last notion that comes to my mind when I read your posts. I enjoy them a lot thou

    stop by at the regex site where I'm hanging around:

    http://regexadvice.com/forums/68/ShowForum.aspx

    it could be helpful

  • I don't have SQL 2005, so I cannot test your fuction against my example.

    What I'm trying to say this task is not just about replacing text between "" including these symbols with empty string.

    It's much more complex.

    You must find openint tag, make sure it's opening tag, find nearest corresponding closing tag and only after that you can remove it.

    Otherwise your function will work like this site - removing all words surrounded by "<" and ">" even when they have nothing to do with HTML tags.

    _____________
    Code for TallyGenerator

  • u r absolutely right Sergiy: that's exactly the procedure that must be followed. For that purpose, more complex regular expressions patterns exist. Like, for example:

    match the opening tag <u>, then text in between, then the closing tag </u>; then remove the matched tags.

    moreover , there is a way to specify a generic tag in the pattern like above.

  • as an example it's possible to render the original text [from say fld "mystring" in tbl dbo.X]

    <a href="my_target_texthttp://www.sergiy.org">my_target_text</a><some_other_not_targeted_tagged_text>

    to

    my_target_text<some_other_not_targeted_tagged_text>

    by using this CLR-based Regex function call:

    select mystring, dbo.Regex_Replace(mystring, '(<a[^<>]*&gt([^<>]*)(</a&gt', '$2')  as myCleanString from dbo.X

    in the call, only Group_2 ($2) of the matched Regex is left in the orig text.

  • sorry, the thing got messed up in my prev post: [hopefully works now]

    select mystring, dbo.Regex_Replace(mystring, '(<a[^<>]*> )  ([^<>]*) ( </a> )', '$2')  as myCleanString from dbo.X

  • Remi,

    thought u might want to see the rest of the thread. Thanks.

    Sergei

  • I've been following it.  I just had nothing else to add.  Regex is the only [simple] solution for this problem and it is far from my area of expertise so I have nothing else to say .

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply