Stop HTML matches with CONTAINSTABLE

  • Hi

    I have tried to find solution for my CONTAINSTABLE problem from Google without any luck, maybe you guys could help me out a bit.

    I have a Query which uses CONTAINSTABLE(tblA(Field1, Field2),'somevalue')

    Field1 is VARCHAR(150) and Field2 is TEXT

    Problem is that Field2 often contains HTML which should not match with search criteria.

    For example, if searched value is 'times', query will match all HTML tags which are setting font to Times.

    Is there a way to tell to FTS that field content is HTML and avoid matches with html tag attributes??

    Thanks.

  • can you add another column of data which contains only the non-HTML data, and fulltext against that instead?

    I've got this old 2000 proc to strip html out via TSQL, but it's faster to use regualr expressions. you could change to varcharmax and add a new column with the nom-html data instead.

    CREATE FUNCTION dbo.fnStripHtmlTags (@Text AS VARCHAR(8000))

    RETURNS VARCHAR(8000)

    AS

    BEGIN

     DECLARE @Return AS VARCHAR(8000)

     DECLARE @StartPos AS SMALLINT

     DECLARE @Loops AS SMALLINT

     SET @Loops = LEN(@Text) - LEN (REPLACE(@Text, '<', ''))

     WHILE @Loops > 0

     BEGIN

      SET @StartPos = CHARINDEX ('<', @Text, 1)

      SET @Text = STUFF(@Text, @StartPos, CHARINDEX ('>', @Text, @StartPos + 1) - @StartPos + 1, '')

      SET @Loops = @Loops - 1

     END

     SET @Return = @Text

     RETURN @Return

    END

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Incorrect solution.

    To full-text index data that is NOT "plain text" (e.g., HTML, XML, Word, PowerPoint, PDF, etc.) you have to configure the indexer to use a filter.  The available filters, with their associated file (content type) extensions, are contained in the sys.fulltext_document_types system view.  You will see that entries for HTML are already present.

    The data to be indexed needs to be stored in a VARBINARY(MAX), formally IMAGE, column with a companion column that has the document type.  The configuration can be done via the older sp_fulltext_column procedure or the newer ALTER FULLTEXT INDEX command.  See the BOL for command details as well as a number of topics related to this.

    Also see Hilary Cotter's extensive posing on full-text in SQL Server at: http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features/

    When the data is being indexed, the filter will remove all of the surrounding stuff and just index the words.

    Works just fine as we're indexing Microsoft Word, Microsoft PowerPoint and PDF documents.

     

     


    [font="Arial Narrow"](PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.[/font]

  • Well our problem is that iFilter can't be changed for TEXT column types.

    As Hilary Cotter's web page states:

    For columns of the CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, and NTEXT data types the indexing engine applies the text iFilter. You can't override this iFilter.

    I wonder if there's some way to configure text iFilter to handle html tags

  • Suggestion: Store a copy of the HTML in a VARBINARY(MAX) column and then full-text index that.  Use this new column for the search operations (to find the row) and then use the text column for the results.


    [font="Arial Narrow"](PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.[/font]

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply