Extract text

  • Hi

    I have a feild in which the data is being saved along with the HTML tags.Now I have to extract the text from this field.

    The data will be like this

    <STRONG><FONT color=#ff0000>Test update</FONT></STRONG>

    Now i need to extract "Test Update" from this.

     

     

     

     

     

  • What makes you think that T-SQL is the best tool to read HTML?

    _____________
    Code for TallyGenerator

  • Well the developers have saved the data like this.Now they just need to extract the text out of this.

  • It's a problem of those developers.

    Let them read the text as they saved it and do with it whatever they want in their application.

    Database is DATA-BASE.

    It's for saving data, not rubbish.

    HTML text should not be allowed in database at all.

    _____________
    Code for TallyGenerator

  • Arun

    Sergiy is quite right, of course.  Now, to answer your question... the following should work.  Somebody might be able to think of something more concise.  It should work no matter how many HTML tags you have, although I've only tried it on the example you gave.

    John

    declare

    @htmlstring varchar(256)

    set

    @htmlstring = '<STRONG><FONT color=#ff0000>Test update</FONT></STRONG>'

    set

    @htmlstring = left(@htmlstring, charindex('</', @htmlstring) - 1)

    set

    @htmlstring = reverse(@htmlstring)

    set

    @htmlstring = left(@htmlstring, charindex('>', @htmlstring) - 1)

    set

    @htmlstring = reverse(@htmlstring)

    select

    @htmlstring

  • What happens if there are more than one line of text to extract in a single row?!?!

  • Ok, this ain't set based... but I don't have time to rewrite it at the moment.  It'll at least give you another option if needed.

    CREATE FUNCTION dbo.fnStripHtmlTags (@Text AS VARCHAR(8000))

    RETURNS VARCHAR(8000)

    AS

    BEGIN

     DECLARE @Return AS VARCHAR(8000)

     DECLARE @StartPos AS SMALLINT

     DECLARE @Loops AS SMALLINT

     SET @Loops = LEN(@Text) - LEN (REPLACE(@Text, '<', ''))

     WHILE @Loops > 0

     BEGIN

      SET @StartPos = CHARINDEX ('<', @Text, 1)

      SET @Text = STUFF(@Text, @StartPos, CHARINDEX ('>', @Text, @StartPos + 1) - @StartPos + 1, '') 

      SET @Loops = @Loops - 1

     END

     SET @Return = @Text

     RETURN @Return

    END

     

    But as everyone else said and will say.  save the data correctly in the database, or let 'em take the trash out.

     

    Garbage IN, Garbage OUT!

  • i'd just delete everything that is in between the tags <.....>, ending up with the targeted text. One function call would do it.

  • Isn't it what my function do?

  • sorry, wasn't attentive enough..yes your function does the same thing..

  • ... but if I had to rebuild me dunction it would use John Mitchell's approach... would be much more efficiant that way.

  • being the official RegEx Junkie

    of the forum, i'll post regular expression's solution for

    this cleaning task

    (for SQL 2005):

    using pre

    -compiled CLR scalar-valued function dbo.Regex_Replace, which takes 2 params: 1) string to clean

    2

    ) regular expression pattern

    and outputs the cleaned string

     

    Create

    table X

    (

    myid int identity,

    mystring

    varchar(1000) not null)

    GO

    Insert

    into dbo.X (mystring)

    Select

    '<STRONG><FONT color=#ff0000>Test update</FONT></STRONG><STRONG><FONT color=#ff0000>Test update</FONT></STRONG>'

    --do cleaning here

    select

    mystring, dean.dbo.Regex_Replace(mystring, '<[^<>]*>', '') from dbo.X

    --rendering the input:

    <

    STRONG><FONT color=#ff0000>Test update</FONT></STRONG><STRONG><FONT color=#ff0000>Test update</FONT></STRONG>

    --to the cleaned form:

    Test updateTest

    update

     

    C# code

    for the CLR function: [compiled in NET 2.0]

    using System

    ;

    using System

    .Data;

    using System

    .Data.SqlClient;

    using System

    .Data.SqlTypes;

    using Microsoft

    .SqlServer.Server;

    using System

    .Text.RegularExpressions;

    public

    partial class UserDefinedFunctions

    {

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString Regex_Replace(String s, String match_pattern, String replace_pattern)

    {

    // Put your code here

    Regex r

    = new Regex(match_pattern, RegexOptions.Multiline|RegexOptions.IgnoreCase);

    return r.Replace(s, replace_pattern).ToString();

    }

    };

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply