November 8, 2004 at 1:25 pm
Hi
I have a column that stores raw html for display in an applet. I would like to parse the contents of this column for entry into another column, so that I can full text index the data and not the html tags. A routine that stripped all content from an '<' char to a '>' char, replacing it with a space, should work as the data shouldn't contain those characters. The data is stored in a text column.
Could someone point me in the right direction?
Thanks
November 9, 2004 at 3:25 am
Hello Greg,
the easiest way to make sure the full-text-indexed column is up to date is a trigger that is executed on every insert or update operation of your table. As I don't know the table structure you are using, I assume that we have a table called "test" with a primary key (id) which is needed for the update within the trigger. Plus we have the column with the original HTML code (htmlorig) and a column to take the results free of HTML tags (content). You will have to rename these according to your needs.
This is how to create the trigger:
~~~~~~~~~~~~~~~~~~
CREATE TRIGGER TR_INS_UPD_TEST ON test FOR INSERT, UPDATE
AS
-- only "listen" to updates of column htmlorig
IF UPDATE(htmlorig)
BEGIN
DECLARE @htmlorig nvarchar(255)
DECLARE @lpos_start int
DECLARE @lpos_finish int
DECLARE @id int
SELECT @htmlorig = htmlorig,
@id = id
FROM inserted
-- find HTML tag by getting position of next '<' and subsequent '>' char
SELECT @lpos_start = charindex(N'<', @htmlorig)
WHILE (@lpos_start > 0)
BEGIN
IF (@lpos_start > 0)
BEGIN
SELECT @lpos_finish = charindex(N'>', @htmlorig, @lpos_start + 1)
IF (@lpos_finish > 0)
-- found an opening and a closing bracket. Remove the whole tag
-- and replace it with a space
BEGIN
SELECT @htmlorig = substring(@htmlorig, 1, @lpos_start - 1)
+ N' '
+ substring(@htmlorig, @lpos_finish + 1, len(@htmlorig) - @lpos_finish)
END
END
-- get next position of '<' char, if any
SELECT @lpos_start = charindex('<', @htmlorig)
END
-- done, @htmlorig now contains the tag-free expression.
-- update the table's content column with the expression
UPDATE test SET content = @htmlorig WHERE id = @id
END
~~~~~~~~~~~~~~~~~~
Please keep in mind that this might kill your application's performance, especially if you update several rows at a time. The trigger will be fired automatically by insert or update commands, wherever they come from. You can comfortably check the operation with the Enterprise Manager by setting new values on the "htmlorig" column. You will see that on leaving the row, the "content" column is updated presently.
November 9, 2004 at 12:25 pm
Hi Johannes
Thanks for your reply.
I am not sure this will work though, as the html column is a text column and can contain more than 8000 characters, so a cast to varchar, or nvarchar won't work.
The data is being converted from another db in an overnight batch process, so performance is not a big worry.
Greg
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply