Counting inflectional forms.. this should be easy surely?!

  • Hello to all and glad to say I've already learnt a lot since joining the site about 15 minutes ago.

    I'm looking for help on what at first I thought was a simple problem. I have a set of data containing information for a website directory which contains columns for website_url, page_description, page_topic etc which I would like users to be able to search by keyword. In order to be useful it is necessary to rank the results by the number of separate occurrences of the users chosen search terms, but also for inflected terms (eg, using microsoft's imaginitive example 'run' would also match 'running' etc).

    The table is full text indexed on the page_description and page_topic columns and returning rows using the CONTAINS (*,'FORMSOF (INFLECTIONAL,run)') predicate works fine, however I would like to count the number of occurances of the search terms and then rank the results according to this measure. I've experimented with CONTAINSTABLE and FREETEXTTABLE and the results were (if we're being kind) a little weak. I have written a procedure that will count the occurrances if (and here is the problem!) they occur in distinct columns, ie if the inflected form of the search term appears in the page_description and page_topic columns. What I can't do is to count how many times the inflected form occurs in the page description..

    So.. is there a function or procedure that counts the occurrances of an inflectional form? Or failing that is there a way to access the inflectional forms produced by the generator term (eg 'run') so I can count them by another method?

    Any help on this one is greatly appreciated as I'm about to give up hope.

    Thanks in advance and keep up all the good work,

    Jeff

  • Haven't the foggiest really, since I don't use fulltext services I can't say I'm particularly versed in it's do's and dont's..

    However, here's a generic description of a method to count the number of occurences of a charachter or pattern given an input string. Perhaps you can adapt the idea in some way.

    -- Sample on how to count # of occurences of a single char

    -- or pattern from a text source

    declare @pattern varchar(8000), @source varchar(8000)

    select @pattern = 'some', @source = 'some string with some text in it'

    select (datalength(@source) - datalength(replace(@source, @pattern, ''))) / datalength(@pattern) as 'patCount'

    /Kenneth

  • Thanks for the swift reply Kenneth,

    The code you've written is very slick but doesn't really get to the root of the problem in that I don't actually know the exact term(s) I'm searching for as they will have been generated from the root word submitted by the user. What I need to do is to find mathces of the inflected terms, so if the user has entered 'run' then I need to count the occurrancecs of the generated terms: 'runner', 'running' etc. The crux of the matter is that I need in some way to access the inflectional forms generated by sql to perform it's expanded searches. I've seen suggestions on other sites that this resides in the thesaurus file however this is only for USER generated alternatives to a given term. What I need are the ones that ship with SQL..

    I'm surprised nobody has had this problem before, unless that is, I'm missing something fairly fundamental.

    Thanks again,

    Jeff

  • Not exactly what you want - and not exact(!) - but have you considered using SOUNDEX()?

  • To follow your example of 'run, runner, running' there are three occurances of run and the search will only find words starting with run. So we can put a space in front of run and then do the count

    so a slight variation of Kenneth's solution

    declare @pattern varchar(8000), @source varchar(8000)

    set @source = ' ' + 'some people run while some are runners running but not runs are not overrun'

    set @pattern = ' ' + 'run'

    select (datalength(@source) - datalength(replace(@source, @pattern, ''))) / datalength(@pattern) as 'patCount'

    Answer: 4

     

    Far away is close at hand in the images of elsewhere.
    Anon.

  • I have no idea how efficient that is but its an amazingly innovative solution!

  • Thanks David, and nice code.

    It takes me part of the way there I think and should be reasonably effiecient. It still leaves me a problem of users who enter 'runner' as their search term.. if we know that the root of 'runner' is 'run' then we're on to a winner but that really takes us back to the original problem of inflected forms. I could specify that users enter the root terms but.. the problems with that one don't really need explaining!

    Thanks for all the answers for this one, I'll post again if I find the solution.

    Cheers,

    Jeff

  • You could try incorporating DIIFERENCE with a minimum return of 3 into your code...

    SELECT DIFFERENCE( 'run', 'runner')

     

    I wasn't born stupid - I had to study.

  • Thanks for the idea Farrel,

    I've played around a bit and, as you say the DIFFERENCE function does indeed yield good results for the inflected forms of most words. I'm not sure that soundex is the way to go though as if you run the same query for SELECT DIFFERENCE( 'honey', 'funny') then you'd also get a 'match'. I think the possibility of mismatches using this means that it's a non-starter although I'm glad to say I've learnt yet another new SQL function.

    I'm currently attacking the problem from another angle shich is to run separate CONTAINSTABLE queries for each of the search terms individually and performing a sum on the ranks of the results sets. This seems to give more sensible results than those offered by FREETEXTTABLE queries but doesn't take me any closer to my ultimate goal which was to be able to highlight the users search terms, whether inflected ornot, in the results returned. I'll keep you all posted on the progress of this particular venture.

    Thanks again for all the tips, they are much appreciated.

    Cheers,

    Jeff

  • Ok, here's a possible ray of hope for anyone following this thread..

    http://www.dotnet-news.com/lien.aspx?ID=3958

    is an article that (apparently) captures the output from the microsoft stemmer component.. however, for my sins (which are many!) I've got no clue about C# so if anyone can make any sense of it then please let me know.

    I'm currently working with aspx VB and it's slow progress converting but I'll let you know how I get on.

    Cheers,

    Jeff

    (message to moderator.. not sure if you allow external link so apologies in advance if you don't)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply