HTML Parsing

  • Hi

    I have a column that stores raw html for display in an applet. I would like to parse the contents of this column for entry into another column, so that I can full text index the data and not the html tags. A routine that stripped all content from an '<' char to a '>' char, replacing it with a space, should work as the data shouldn't contain those characters. The data is stored in a text column.

    Could someone point me in the right direction?

    Thanks

  • Hello Greg,

    the easiest way to make sure the full-text-indexed column is up to date is a trigger that is executed on every insert or update operation of your table. As I don't know the table structure you are using, I assume that we have a table called "test" with a primary key (id) which is needed for the update within the trigger. Plus we have the column with the original HTML code (htmlorig) and a column to take the results free of HTML tags (content). You will have to rename these according to your needs.

    This is how to create the trigger:

    ~~~~~~~~~~~~~~~~~~

    CREATE TRIGGER TR_INS_UPD_TEST ON test FOR INSERT, UPDATE

    AS

      -- only "listen" to updates of column htmlorig

      IF UPDATE(htmlorig)

        BEGIN

          DECLARE @htmlorig nvarchar(255)

          DECLARE @lpos_start int

          DECLARE @lpos_finish int

          DECLARE @id int

          SELECT @htmlorig = htmlorig,

                 @id = id

            FROM inserted

          -- find HTML tag by getting position of next '<' and subsequent '>' char

          SELECT @lpos_start = charindex(N'<', @htmlorig)

          WHILE (@lpos_start > 0) 

            BEGIN

              IF (@lpos_start > 0)

                BEGIN

                  SELECT @lpos_finish = charindex(N'>', @htmlorig, @lpos_start + 1)

                  IF (@lpos_finish > 0)

                    -- found an opening and a closing bracket. Remove the whole tag

                    -- and replace it with a space

                    BEGIN

                      SELECT @htmlorig = substring(@htmlorig, 1, @lpos_start - 1)

                                       + N' '

                                       + substring(@htmlorig, @lpos_finish + 1, len(@htmlorig) - @lpos_finish)

                    END

                END

              -- get next position of '<' char, if any

              SELECT @lpos_start = charindex('<', @htmlorig)

            END

          -- done, @htmlorig now contains the tag-free expression.

          -- update the table's content column with the expression

          UPDATE test SET content = @htmlorig WHERE id = @id

        END

    ~~~~~~~~~~~~~~~~~~

    Please keep in mind that this might kill your application's performance, especially if you update several rows at a time. The trigger will be fired automatically by insert or update commands, wherever they come from. You can comfortably check the operation with the Enterprise Manager by setting new values on the "htmlorig" column. You will see that on leaving the row, the "content" column is updated presently.

  • Hi Johannes

    Thanks for your reply.

    I am not sure this will work though, as the html column is a text column and can contain more than 8000 characters, so a cast to varchar, or nvarchar won't work.

    The data is being converted from another db in an overnight batch process, so performance is not a big worry.

    Greg

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply