Extracting text from html stored in SQL table!

  • Hi All,

    I have a column in one of SQL2000 table which has dhtml data stored froom an old ASP application. As we are migrating to n SQL and .net we want to extract the current data from Production table. Any one of you aware how I can extract the text from the dhtml code stored as text in the SQL.

    Please refer to sample data attached.

    Thanks,

    Nikhil

  • Nikhil,

    What text exactly do you want to 'extract' from this? The entire database column? If that is the case, you could simply read the column data as text and write it out to your new database. Or are you looking to extract certain pieces of what would be the text displayed in the browser when this 'page' was viewed? If so, which pieces?

    Rob Schripsema
    Propack, Inc.

  • Please follow the following steps:

    1. Get data from the database using the ASP code

    2. REmove the HTML tags from the Data retrieved

    Refer: http://www.codeproject.com/KB/asp/removehtml.aspx

  • Looks like you have some info from the CodeProject site on using Regular Expressions or straight parsing algorithms to extract the text you're looking for. My recommendation would be to write a CLR function (using VB or C#) in SQL that will strip off the HTML tags (per the CodeProject article) and return just the text. Then you can call that function from a SELECT statement, pass the DHTML column through the function, and write the output to a new table or wherever you need to go with it.

    Not sure if that's the kind of help you were looking for here, but it looks like you're most of the way there already.;-)

    Rob Schripsema
    Propack, Inc.

  • I dont think CLR could be used here as an approach. Though it is one of the options.

    1. The enviornment currently used in legacy system where ASP and Sql server 2000 is used.

    2. SQL server 2000 as per my understanding doesnt all SQLCLR programming , besides it will require .Net scripting skills I am not sure if resources could be available.

    3. Besides to use SQLCLR you have to evaluate in terms of memory usage and performance as it works directly with SQL Server. In case of any memory leak your database could also crash. So before using it compartively lot of testing is needed

  • Just give it a try...

    -- To get the list of Tags removed....

    Declare @vTagNameTable Table(Tag varchar(max))

    -- My HTML

    Declare @v-2 varchar(max)

    Set @v-2 = '<html DIR="LTR" xmlns:tool="http://www.microsoft.com/tooltip" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ddue="http://ddue.schemas.microsoft.com/authoring/2003/5" xmlns:MSHelp="http://msdn.microsoft.com/mshelp"><A>Atif Sheikh < Atif

    <Bold>Sheikh</Bold> </HTML> Atif <><><>>'

    Declare @vCharI1 int -- Used to locate '<'

    Declare @vCharI2 int -- Used to locate '>' to get Tag Name

    Declare @vCharICheck int -- Used to Check for Tag Name

    Declare @vEndTagName varchar(1000)

    Declare @vStartPos int

    Set @vStartPos = 0

    While 1=1

    begin

    Set @vCharI1 = CharIndex('<',@v,@vStartPos)

    if @vCharI1 > 0

    begin

    -- Get the Tag Name, if it is a HTML Tag

    Set @vCharI2 = CharIndex('>',@v,@vCharI1 + 1)

    if @vCharI2 > 0

    begin

    -- Check <>

    if @vCharI2 = @vCharI1 + 1

    begin

    Set @vStartPos = @vCharI2 + 1

    end

    else

    begin

    Set @vCharICheck = CharIndex('<',@v,@vCharI1 + 1)

    if (@vCharICheck < @vCharI2)

    begin

    -- Last < found was not start of tag. Just < character / sign

    if @vCharICheck > 0

    Set @vStartPos = @vCharICheck

    else

    begin

    -- @vCharICheck = 0

    -- Its a Last TAG...

    Set @vEndTagName = SubString(@v,@vCharI1,(@vCharI2-@vCharI1) + 1)

    if Not Exists(Select 1 from @vTagNameTable where Tag = @vEndTagName)

    begin

    Insert into @vTagNameTable

    Select @vEndTagName

    Set @v-2 = Replace(@v,@vEndTagName,'')

    end

    Set @vStartPos = @vCharI2

    end

    if @vStartPos = Len(@v)

    begin

    Select @vStartPos , Len(@v),@vCharICheck,@vCharI1

    print 'Over Here....'

    BREAK

    end

    end

    else

    begin

    -- ITS A TAG....

    Set @vEndTagName = SubString(@v,@vCharI1,(@vCharI2-@vCharI1) + 1)

    if Not Exists(Select 1 from @vTagNameTable where Tag = @vEndTagName)

    begin

    Insert into @vTagNameTable

    Select @vEndTagName

    Set @v-2 = Replace(@v,@vEndTagName,'')

    end

    --Set @vStartPos = @vCharI2 + 1

    Set @vStartPos = @vCharI1

    if @vStartPos > Len(@v)

    begin

    print 'Here....'

    BREAK

    end

    end

    end

    end

    else

    BREAK -- No Tag

    end

    else

    BREAK

    end

    Select * from @vTagNameTable

    Select @v-2 as StringWithoutTags

    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sometimes, winning is not an issue but trying.
    You can check my BLOG
    [font="Arial Black"]here[/font][/url][/right]

  • I used the below script and I am not getting clean text.

    I want to extract the whole text which a browser would display for my dhtml code. The below script is giving some unrecognised characters. I am running short of time for other things, so was not able to look into the issue and posts. Thanks for all the replies Guys!

    USE [OPINIKHIL]

    GO

    /****** Object: StoredProcedure [dbo].[udf_StripHTML] Script Date: 06/10/2010 10:22:42 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE procedure [dbo].[udf_StripHTML]

    (@HTMLText varchar(8000))

    --RETURNS varchar(8000)

    AS

    BEGIN

    DECLARE @Start int

    DECLARE @End int

    DECLARE @Length int

    -- Replace the HTML entity & with the '&' character (this needs to be done first, as

    -- '&' might be double encoded as '&')

    SET @Start = CHARINDEX('&am;p;', @HTMLText)

    SET @End = @Start + 4

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = @Start + 4

    SET @Length = (@End - @Start) + 1

    END

    -- Replace the HTML entity < with the '<' character

    SET @Start = CHARINDEX('<', @HTMLText)

    SET @End = @Start + 3

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')

    SET @Start = CHARINDEX('<', @HTMLText)

    SET @End = @Start + 3

    SET @Length = (@End - @Start) + 1

    END

    -- Replace the HTML entity > with the '>' character

    SET @Start = CHARINDEX('>', @HTMLText)

    SET @End = @Start + 3

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')

    SET @Start = CHARINDEX('>', @HTMLText)

    SET @End = @Start + 3

    SET @Length = (@End - @Start) + 1

    END

    -- Replace the HTML entity & with the '&' character

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = @Start + 4

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = @Start + 4

    SET @Length = (@End - @Start) + 1

    END

    -- Replace the HTML entity with the ' ' character

    SET @Start = CHARINDEX('', @HTMLText)

    SET @End = @Start + 5

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')

    SET @Start = CHARINDEX('', @HTMLText)

    SET @End = @Start + 5

    SET @Length = (@End - @Start) + 1

    END

    -- Replace any

    tags with a newline

    SET @Start = CHARINDEX('

    ', @HTMLText)

    SET @End = @Start + 3

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))

    SET @Start = CHARINDEX('

    ', @HTMLText)

    SET @End = @Start + 3

    SET @Length = (@End - @Start) + 1

    END

    -- Replace any

    tags with a newline

    SET @Start = CHARINDEX('

    ', @HTMLText)

    SET @End = @Start + 4

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')

    SET @Start = CHARINDEX('

    ', @HTMLText)

    SET @End = @Start + 4

    SET @Length = (@End - @Start) + 1

    END

    -- Replace any

    tags with a newline

    SET @Start = CHARINDEX('

    ', @HTMLText)

    SET @End = @Start + 5

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')

    SET @Start = CHARINDEX('

    ', @HTMLText)

    SET @End = @Start + 5

    SET @Length = (@End - @Start) + 1

    END

    -- Remove anything between <whatever> tags

    SET @Start = CHARINDEX('<', @HTMLText)

    SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')

    SET @Start = CHARINDEX('<', @HTMLText)

    SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))

    SET @Length = (@End - @Start) + 1

    END

    --select LTRIM(RTRIM(@HTMLText))

    insert into nikhil values( LTRIM(RTRIM(@HTMLText)))

    --RETURN LTRIM(RTRIM(@HTMLText))

    END

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply