A Google-like Full Text Search

  • This has been a most useful application. I have learned quite a bit about Microsoft technologies, assemblies, strong naming, COM enabling, using assemblies as part of the web server, and as part of SQL server itself.

    I recently added Irony to SQL server 2005, installed in the master database under assemblies I have

    System.Core, Irony_Local, and SQLFTSGoogle, and a wrapper function called PARSEQUERY.

    So enabling full text search on one our databases, we have better granuliarty in our searches without forcing our community of users to figure out the syntax.

    Here are some examples assembled for our users

    USE [Marketing]

    DECLARE @ftsQuery AS NVARCHAR(4000)

    -- Example A: Search for the inflectional forms of the word fail

    --SET @ftsQuery = master.dbo.PARSEQUERY('fail')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example B: Searches for the word fail without generating inflectional forms

    --SET @ftsQuery = master.dbo.PARSEQUERY('+fail')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example C: Search for the thesaurus forms of the word various

    --SET @ftsQuery = master.dbo.PARSEQUERY('~various')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example D: Search for documents containing inflectional forms of the words tyler or karl

    --SET @ftsQuery = master.dbo.PARSEQUERY('tyler OR karl')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example E: Search for documents containing inflectional forms of the words tyler AND karl

    -- the keyword AND is optional

    --SET @ftsQuery = master.dbo.PARSEQUERY('karl snow')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example F: Performs a phrase search for the phrase "Delta Air"

    --SET @ftsQuery = master.dbo.PARSEQUERY('"delta air"')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example G: Search for documents containing inflectional forms of the word tyler

    -- but not the word will

    --SET @ftsQuery = master.dbo.PARSEQUERY('tyler -LLP')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example H: Searches for words that begin with the prefix DEL

    --SET @ftsQuery = master.dbo.PARSEQUERY('DEL*')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example I: Searches for documents that contain the word asbestos in close proximity to the

    -- word liability

    --SET @ftsQuery = master.dbo.PARSEQUERY('<asbestos liability>')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example J: Searches for documents that contain the word asbestos in close proximity to the

    -- word injury

    --SET @ftsQuery = master.dbo.PARSEQUERY('<asbestos injury>')

    --SELECT @ftsQuery AS FTSQUERY

    -- Example K: The AND operator takes precedence over the OR operator in queries. You can change the

    -- operator precedence by using parentheses. Consider the following query:

    --SET @ftsQuery = master.dbo.PARSEQUERY('asbestos injury OR liability')

    --SELECT @ftsQuery AS FTSQUERY

    -- asbestos AND injury OR liability (67 hits)

    -- Example L: Changing the order of precedence using parenthesis

    SET @ftsQuery = master.dbo.PARSEQUERY('asbestos (injury OR liability)')

    SELECT @ftsQuery AS FTSQUERY

    -- asbestos AND (injury OR liability) (31 hits)

    -- The table [Marketing].[dbo].[Conflict Check Requests FTS] has the Full Text Search feature enabled

    -- The CONTAINS predicate only searches within columns specified for FTS search

    SELECT [ID_Number],

    [Adverse_Parties],

    [Allegations],

    [Attorneys_Adverse_To],

    [Conflict_Addtional_Comments],

    [Corporate_Client],

    [Hiring_Law_Firms],

    [Originator],

    [Other_Interested_Parties],

    [Referral_Source_Other],

    [Services_Provided],

    [TSP_Billing_Category]

    FROM dbo.[Conflict Check Requests FTS]

    WHERE CONTAINS (*, @ftsQuery)

    ORDER BY ID_NUMBER

    Like I said I learned a lot due this article, I have the standalone version implemented as a DLL, an assembly used by the intranet, and a SQL server assembly, which is slightly different than the intranet version.

    I will post the code for the differing assemblies along with a detailed write up of what I have learned.

    Thank you

  • Hi Mike, great code!

    It works fine for me in general but I´m having problems when a word begins with "and" or with "or" like "orthopedics".

    I´ve change, in SearchGrammar.cs OrOperator.Rule = Symbol("or") for OrOperator.Rule = Symbol("or ") (note the blank space after the "or") and it works, but I´m not sure its a good solution.

    I´m also having problems is with one letter words, if you search "when I grow up" doesn´t return any result, but if you search "when grow up" it returns matching records including the one who has "when I grow up". FORMSOF (INFLECTIONAL, I) returns nothing.

    Thanks,

    Marcelo.

  • Hi Marcelo,

    The "or" issue was dealt with previously in this thread. If you go back a few pages you'll see the fix. The reason your query's returning nothing with "I" is because "I" is a stopword. If you create a custom stoplist for your full-text index you can remove "I" from the stoplist, or you could modify the code to remove stopwords before the fts query is run. (Option 1 would be easier.)

    Thanks

    Mike C

  • Wonderful.

    Even more useful is the idea that your technique can bridge grammars between SQL and other external apps and not just search.

  • "The description of the article seems to fit and there are plenty of people that would like to use Google syntax to build search on FTS. "

    SQLServerCentral.com should implement this first in this site. Google syntax is default 'AND' not 'OR'.

    I search for 'faceted' i get 123 results

    I search for 'faceted search' i get 14401 results obviously the server is doing a OR search instead of narrowing the search it is expanding it. Boo boo

    I search for 'faceted and search' i still get 14401 results obviously the server is blind to 'AND'.

    I search for 'faceted search navigation' I get 15523 results more of the OR logic at work.

    Still I want to ask anyone here do you know of any article that implements faceted navigation search in SQL Server?

    Faceted search is like on download.com like when you search for 'clinic' it will show you the results with a left pane indicating what else you can add to your search to narrow down your search with a count of results to show how well an additional term will narrow your search, like information scent.

  • Wow. What are the odds? I was just upgrading my websites "SQL search engine" to Full Text Search with a Google style search query syntax. Only this article came a week too late, so I had to figure it out on my own!!! :sick:

    Still though, being the head (and only) programmer at my job, I always appreciate a much much better programmers insight on how to solve the accomplish the same task in a much better way. THANKS!!!

    FYI, I ended up splitting the query in to an array, and then using a combination of instr, regex, and other arrays to keep track of what "words" would be added to the SQL as a phrase, not like, FREETEXTTABLE, etc., etc. It works, but needless to say my methodology is a bit more archaic.

  • I love this bit of code.

    One thing, I found the exclude operator worked a bit inconsistently for me. For example, searching for words that contained hyphens didn't work correctly, often excluding the second part of the hyphenated word.

    I also had the same problem with starting search with an exclusion.

    I've made a few adjustments to remedy these issues, to operate in a slightly more google-like fashion.

    These are not rigorously tested and I know that compilers have a nasty habit of doing unexpected things, so please test and post any bugs.

    Firstly, I attempted to add "-" to the token, but this led to the - characters being consumed by the Term and not being processed as operators, so have just added the "-" to the Term constructor for "All characters" not "All first characters".

    This should allow

    -Happy-Birthday to be parsed as two tokens: - (ExcludeOperator) and Happy-Birthday(terminal)

    The other issue was the placement of the exclude operator. Since the exact operator works in the way that we might want to run an exclude, I added an ExcludeExpression production rule following the same pattern and removed the ExcludeOperator from the "And" production rule.

    To ensure that the whole thing was appropriately encapsulated, I just added brackets to the NOT output expression, which seems to do the job.

    Beneath is my version of the production rules

    public SearchGrammar()

    {

    // Terminals

    IdentifierTerminal Term = new IdentifierTerminal

    (

    "Term",

    "!@#$%^*_-'.?",

    "!@#$%^*_'.?0123456789"

    );

    //Term.Priority = Terminal.LowestPriority;

    var Phrase = new StringLiteral("Phrase");

    // NonTerminals

    var OrExpression = new NonTerminal("OrExpression");

    var OrOperator = new NonTerminal("OrOperator");

    var AndExpression = new NonTerminal("AndExpression");

    var AndOperator = new NonTerminal("AndOperator");

    var ExcludeOperator = new NonTerminal("ExcludeOperator");

    var ExcludeExpression = new NonTerminal("ExcludeExpression");

    var PrimaryExpression = new NonTerminal("PrimaryExpression");

    var ThesaurusExpression = new NonTerminal("ThesaurusExpression");

    var ThesaurusOperator = new NonTerminal("ThesaurusOperator");

    var ExactOperator = new NonTerminal("ExactOperator");

    var ExactExpression = new NonTerminal("ExactExpression");

    var ParenthesizedExpression = new NonTerminal("ParenthesizedExpression");

    var ProximityExpression = new NonTerminal("ProximityExpression");

    var ProximityList = new NonTerminal("ProximityList");

    this.Root = OrExpression;

    OrExpression.Rule = AndExpression

    | OrExpression + OrOperator + AndExpression;

    OrOperator.Rule = Symbol("or") | "|";

    AndExpression.Rule = PrimaryExpression

    | AndExpression + AndOperator + PrimaryExpression;

    AndOperator.Rule = Empty

    | "and"

    | "&";

    ExcludeOperator.Rule = Symbol("-");

    ExcludeExpression.Rule = ExcludeOperator + Term | ExcludeOperator + Phrase;

    PrimaryExpression.Rule = Term

    | ThesaurusExpression

    | ExactExpression

    | ExcludeExpression

    | ParenthesizedExpression

    | Phrase

    | ProximityExpression;

    ThesaurusExpression.Rule = ThesaurusOperator + Term;

    ThesaurusOperator.Rule = Symbol("~");

    ExactExpression.Rule = ExactOperator + Term

    | ExactOperator + Phrase;

    ExactOperator.Rule = Symbol("+");

    ParenthesizedExpression.Rule = "(" + OrExpression + ")";

    ProximityExpression.Rule = "<" + ProximityList + ">";

    MakePlusRule(ProximityList, Term);

    RegisterPunctuation("<", ">", "(", ")");

    }

    and the change to the ConvertQuery function

    The new AndExpression case

    case "AndExpression":

    AstNode tmp2 = node.ChildNodes[1];

    string opName = tmp2.Term.Name;

    string andop = "";

    andop = " AND ";

    type = TermType.Inflectional;

    result = "(" + ConvertQuery(node.ChildNodes[0], type) + andop +

    ConvertQuery(node.ChildNodes[2], type) + ")";

    type = TermType.Inflectional;

    break;

    The new ExcludeExpression case

    case "ExcludeExpression":

    result = " (NOT (" + ((Token)node.ChildNodes[1]).ValueString + ")) ";

    break;

    This is 90% of the way there and should correctly search for examples like the following

    Testing -Happy-birthday

    Testing Happy-birthday

    There may be some strangeness when using quotes e.g. -"Happy-Birthday Test" - I've not tested fully but some thought is required as to the desired behaviour in this case, e.g. do we want to include

    NOT Happy-Birthday AND NOT Test --or--

    NOT Happy-Birthday OR NOT Test

    Because the current production rules would exclude results that had either word in them, rather than the intention of excluding where both words are present.

    Another limitation is if you are searching for something that starts with a dash.

    Anyway, hope this helps,

    Mike Renwick

  • Scrub that, just realised the foibles of using NOT in an FTS query hence the strange production rules in the original article. I will post my working version when complete.

  • Hi Mike

    Nice change! I'll have to play around with it later. A couple of things to note:

    * the - operator maps directly to the ifts AND NOT operator.

    * the double quotes indicate an ifts phrase search, and they map directly to the ifts "term1 term2 ..." syntax, including double quotes. So I'd expect -"happy-birthday cake" to convert to AND NOT "happy-birthday cake".

    If the default ifts phrase search behavior is not what's wanted you'd have to generate a different ifts output from the double quote input.

    Thanks

    Mike C

  • mike.renwick-894639 (6/4/2010)


    Scrub that, just realised the foibles of using NOT in an FTS query hence the strange production rules in the original article.

    I originally had two solutions to this problem of leading not operator:

    1) Ignore any leading NOT operator, or throw an exception that it's not supported. This is the easier way to deal with it.

    2) Rearrange the AST to ensure NOT operators come last. This would be a more elegant solution but more complex to implement.

    For the article I decided to keep it as simple as possible, but it might be time to revisit the issue.

    Thanks

    Mike C

  • Hi, thanks for the wonderful code!! I have a quick question regarding error recovery when parsing the query. I'm wondering if its possible to gracefully recover so that whatever is parsed so far is retained while the parts that's causing the error is ignored.

    For example:

    A user use the following as the query string:

    " space " ; astronaut

    This causes an error on the parser and I'm not sure how to set the ErrrorRule so that the query will at least generate two phrase search with '" space " and astronaut'.

  • Great article, been looking for something along these lines.

    I'm wanting to display the rank as a % and order by this to move relevent results to the top, how does the following look?

    DECLARE @topRank int set @topRank=(SELECT MAX(RANK)

    FROM CONTAINSTABLE([Support_Calls], Problem, @ftsQuery, 1))

    SELECT [ID] AS [Call No],Company_Name,Call_Opened_Date, Problem, Solution,

    CONVERT(VARCHAR(20),CAST((CAST(KEY_TBL.RANK as DECIMAL)/@topRank * 100) AS DECIMAL(13,0))) + '%' as Match FROM [Support_Calls] AS FT_TBL INNER JOIN

    CONTAINSTABLE([Support_Calls], Problem, @ftsQuery) AS KEY_TBL ON FT_TBL.[ID] = KEY_TBL. ORDER BY KEY_TBL.RANK DESC;

    Is there a cleaner way of doing this with CONTAINS?

  • madlan (11/18/2010)


    Great article, been looking for something along these lines.

    I'm wanting to display the rank as a % and order by this to move relevent results to the top, how does the following look?

    DECLARE @topRank int set @topRank=(SELECT MAX(RANK)

    FROM CONTAINSTABLE([Support_Calls], Problem, @ftsQuery, 1))

    SELECT [ID] AS [Call No],Company_Name,Call_Opened_Date, Problem, Solution,

    CONVERT(VARCHAR(20),CAST((CAST(KEY_TBL.RANK as DECIMAL)/@topRank * 100) AS DECIMAL(13,0))) + '%' as Match FROM [Support_Calls] AS FT_TBL INNER JOIN

    CONTAINSTABLE([Support_Calls], Problem, @ftsQuery) AS KEY_TBL ON FT_TBL.[ID] = KEY_TBL. ORDER BY KEY_TBL.RANK DESC;

    Is there a cleaner way of doing this with CONTAINS?

    Hi madlan,

    It looks good from here, although I can't test it to be sure without the tables and sample data. I would account for @topRank possibly being 0 (avoid a div by zero error with a CASE expression and substitute NULL). You can't get a RANK back with CONTAINS, only with CONTAINSTABLE or FREETEXTTABLE so this is your only option with a rank. You might be able to clean your code up a touch with a CTE, something like this:

    ;WITH CTE

    AS

    (

    SELECT [ID] AS [Call No],Company_Name,Call_Opened_Date, Problem, Solution, CAST(KEY_TBL.RANK AS DECIMAL(13, 0)) AS RANK

    FROM [Support_Calls] AS FT_TBL

    INNER JOIN CONTAINSTABLE([Support_Calls], Problem, @ftsQuery) AS KEY_TBL

    ON FT_TBL.[ID] = KEY_TBL.

    )

    SELECT *, CONVERT(VARCHAR(20), RANK / NULLIF((SELECT MAX(RANK) FROM CTE), 0.0) * 100) + '%' as Match

    FROM CTE

    ORDER BY RANK DESC;

    NOTE: Code untested.

  • Thanks Mike!

    How are results sorted when using CONTAINS? The only reason I'm using CONTAINSTABLE is so I can order by relevance.

  • Thanks, good article.

    But there is one thing which is a bit confusing me.

    When I am trying to do a search like 'test (' or like 'test and'

    I am getting a error. But I could do a search in Google using that without any error.

    So, the idea is that user has to be able to write anything he likes and search has to be done in any case.

    What you could suggest?

Viewing 15 posts - 106 through 120 (of 166 total)

You must be logged in to reply to this topic. Login to reply