Infuriating search problem

  • I'm developing a web-based application to search a SQL Server database containing keyworded data. The keywords for each row are stored in a nvarchar field simply as a list of words, but I need a search which will match up the keyword "natural" to searches for "nature" for example.

    I've tried: SELECT * FROM table WHERE FREETEXT(keywords, 'nature')

    And I've tried SELECT * FROM table WHERE CONTAINS(keywords, 'FORMSOF(INFLECTIONAL, nature)')

    Neither works and a search for nature returns 0 results. So what's the deal with this full-text indexing? Everything I see online says this should be possible, but I'm getting nowhere! Am I doing something wrong?

    Yes, I have tried other words than nature, by the way

  • This syntaxe seems to accomplish what you ar trying to do :

    Select *

    FROM dbo.Problemes

    WHERE CONTAINS(TitreProbleme, ' "te*" ')

  • But that's no use if I'm expecting the user to enter the search term.

  • Just use it like this :

    Declare @Search as varchar(50)

    set @Search = '"te*"'

    Select *

    FROM dbo.Problemes

    WHERE CONTAINS(TitreProbleme, @Search)

    Tell your users that * is the wildcard (can't be used on the front of the search). Then simply add a doublequote around their search text and you're set.

  • That's not gonna cut it. The freetext thing is an absolute requirement of the system. I think the conclusion here is that full text indexing really doesn't do what it claims to do and I will have to look elsewhere for a solution.

  • See if this helps: http://sqljunkies.com/WebLog/JT_Kane/

    or search here for John T. Kane. He is very specialised in all varieties of full text questions.

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Sorry, this link might be better http://spaces.msn.com/members/jtkane/

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • I guess I'm not understanding your requirements because I don't see how this could not work for you. Can you send me some sample data from the data, the search criteria and the search results you expect?

  • It's a web site, so I need non-technical users to be able to just tap in a keyword and get results. It's impractical to ask the data-entry guys to come up with every single variation of a word they can think of when keywording the data (i.e. biology, biological, nature, natural) so I need a search that can figure out that these things are the same.

    What bothers me is that every website I've seen about full text indexing claims that FREETEXT and FORMSOF perform stemming on the search term, which would bring the word down to a common root and search for other words derived from that. Essentially doing what you describe with te* but without expecting the users, who are just casual web surfers, to do it for themselves.

    However, practical experimentation shows that this is clearly not the case. Quite what the purpose is of these predicates if they don't perform proper stemming I don't know. What also bothers me is why there is such a wealth of websites out there which all claim that it can do exactly that.

  • In my test case I was searching for the word "test" hence the criteria "te*". The only solution I see for this problem would be to chop the last 2-3 letters (or X percent) of your word and replace it with the wildcard. But I've never implemented what you are trying to acomplish so I can't help you any further than this.

    Good luck.

  • I found the Porter stemming algorithm (http://www.ils.unc.edu/keyes/java/porter/index.html) to be ideal for doing exactly that. It just infuriates me (hence the thread title) that I have to implement stemming in my code when Microsoft's documentation on the FREETEXT predicate claims that it performs stemming itself.

  • Hi Rob,

    I'm a bit late in this tread, and I thank Frank for pointing you to my blog, as I was able to trace back the referal to this tread and I hope I can answer some of your questions, or perhaps have you ask more questions...

    I see that you've found the Porter stemmer algorithm and while FREETEXT does do stemming, it is unknown at this time, if Microsoft uses the Porter stemmer, a variation or some other stemming algorthm. I've posted on this issue a couple of time in the fulltext newsgroup, and it would be helpful to know the version of SQL Server and what OS Platform you have installed it on. Could you post the full output of -- SELECT @@version -- as this would be most helpful.

    As the "devil is in the details", you can use an Indexing Service Platform SDK utiltity (LRTest.exe) to determine how exactly the OS-supplied and language-specific word breaker dll (infosoft.dll for US English) will stem and return varaitions on specific words. Using your example of two words - "nature" and "natural" on Windows Server 2003, you get the following results:

    Original text: 'nature'

    IWordFormSink:utAltWord: cwc 6, 'nature'

    IWordFormSink:utAltWord: cwc 8, 'nature's

    IWordFormSink:utAltWord: cwc 7, 'natures'

    IWordFormSink:utWord: cwc 8, 'natures''

    Original text: 'natural'

    IWordFormSink:utAltWord: cwc 7, 'natural'

    IWordFormSink:utAltWord: cwc 9, 'natural's'

    IWordFormSink:utAltWord: cwc 8, 'naturals'

    IWordFormSink:utWord: cwc 9, 'naturals''

    So, for Microsoft's OS-supplied stemming, the above two words are not related. However, as the linguistic stemming of English words is not an exact science, I also tested these two words with the Google synonym function "~" for 'nature'

    (http://www.google.com/search?&q=~nature) and got results for both

    nature and natural. However, when I subtracted out natural

    (http://www.google.com/search?&q=~nature+-natural), I still found a hit for Naturalist on the second page, so Google is most likely using a different stemming method for its synonym function.

    Furthermore, You might want to consider using phonetic matching techniques, such as SOUNDEX and DIFFERENCE, see SQL Server 2000 BOL title "Comparing SOUNDEX and DIFFERENCE" for more info. Additionally, you can combining SOUNDEX and N-grams and this *might* be a solution and you should review the following paper "Improving Precision and Recall for SOUNDEX Retrieval" at: http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf as it is helpful in improving the SOUNDEX algorithm. Still there are other methods for this problem as I suspect Google's "Did you mean" spelling suggestion is not using a spelling dictionary and instead they using Text Mining techniques against their Search query logs for word suggestions.

    Finally, and as I'm no lingualst, it may be that nature is not a form of natural, but rather a "related word" or not exactly a synonyn, see http://www.hyperdictionary.com/search.aspx?define=natural.

    Hope this helps!

    John

    SQL Full Text Search Blog

    http://spaces.msn.com/members/jtkane/


    John T. Kane

  • The version returns:

    Microsoft SQL Server  2000 - 8.00.760 (Intel X86)

     Dec 17 2002 14:22:05

     Copyright (c) 1988-2003 Microsoft Corporation

     Standard Edition on Windows NT 5.0 (Build 2195: Service Pack 4)

    NT 5.0 is Windows 2000, right?

    Anyway, this is my second attempt to write this reply (the first one inexplicably vanished after I hit "post"). You are correct in saying that natural is not strictly a synonym of nature, but stemming is not intended to return synonyms (a synonym list for nature includes words like "earth" and "forest", for example).

    According to http://www.comp.lancs.ac.uk/computing/research/stemming/general/ stemming is supposed to return "morphological variants" of a search term by reducing a word to its root form. The two search words you have tried in your test have not been reduced at all, instead only "s" and an apostrophe have been added in order to return the plural forms of the words.

    In any case the Porter Stemmer algorithm, implemented in JSP, gives pretty useful results when used in conjunction with the CONTAINS predicate. However, anyone interested in implementing a search engine style feature with SQL Server should be aware that it does NOT implement stemming itself, no matter what Microsoft and a dozen other websites seem to claim.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply