Advice on how to return results by matching score

  • Hi Gurus,

    I would like to have your advice on the following "idea"...

    Background:

    In order to ease the understanding, let's consider a site dedicated to house renting for example.

    Owners would define their house characteristics like (price, location, environment...).

    In order to enter all data, the owner is invited to use dropdown lists for some special fields (environment, house type, ...).

    Some characteristics are mandatory, some others are not. Also, the owner can enter description as free text.

    In order to record a house entry, I use one dedicated table (called tbl_houses) which makes links to other tables (list of characteristics).

    Also free text as well as "translated characteristics" are stored in a table called tbl_description.

    Objectives:

    I want to let users the possibility to query the database as follows:

    * they can fill in any criteria (linked to house characteristics) but they are not obliged to

    * they can enter some keywords at the same time.

    Situation:

    Everything works ok today except that once a characteristics is set by the user and does not correspond to data previously entered by the owner, records are not retrieved,

    which is normal.

    Ideas:

    I would like to give a score to criteria matching and sort the results by score descending.

    In other words, let's take an example:

    tbl_houses:

    house_idbigint identity(1,1),

    owner_idbigint,-- refers to tbl_owners

    data_0int,

    data_1int,

    data_2int

    tbl_description:

    description_idbigint identity(1,1),

    house_idbigint, -- refers to tbl_houses

    description_textvarbinary(MAX)-- stores the descriptions in HTML format due to other constraints I have, as multilinguism

    Today, queries are dynamically built, based on the query criteria entered by the users but they are of the following format:

    SELECT H.house_id, O.owner_name, H.data_0, H.data_1, H.data_2

    FROM tbl_houses H

    INNER JOIN tbl_owners O ON H.owner_id = O.owner_id

    INNER JOIN (

    SELECT DISTINCT TD.house_id

    FROM tbl_description TD

    INNER JOIN CONTAINSTABLE(dbo.tbl_description, [description_text], @freetext) AS ftt

    ON TD.description_id = ftt.

    ) S

    ON H.house_id = S.house_id

    WHERE

    H.data_0 = @data_0_value

    AND

    H.data_1 = @data_1_value;

    Ideal objective:

    I would like the query to return the "top ranked" results at first, followed by "lower ranked" ones.

    I was thinking about the following:

    1) If a criteria is set, to give it a High score if matching but a lower if not

    2) If free text keywords are entered, to give a High rank for descriptions matching the keywords, a lower rank for the other ones.

    3) Not exclude any results but give them a lower rank.

    Therefore my idea was (since I am dynamically building the SQL statement):

    1) If a criteria is defined by the user to check if the data is defined at the record level.

    If yes, give a score depending on the matching, otherwise if not defined by the owner, give a score = 1 (for doubt)

    2) Same would apply to keywords

    3) Some exclusion criteria could exist

    This could be translated as follows (considering the fact that if a criteria is not defined by the owner, the data = -1 at the database level)

    DECLARE @topRankdecimal;

    SET @topRank = (SELECT MAX(RANK) FROM CONTAINSTABLE(dbo.tbl_description, [description], @in_freetext, LANGUAGE @in_language_LCID));

    WITH query_fts AS (

    SELECT DISTINCT TD.house_id, ftt.[RANK] AS FT_RANK

    FROM tbl_description TD

    INNER JOIN CONTAINSTABLE(dbo.tbl_description, [description_text], @freetext) AS ftt

    ON TD.description_id = ftt.

    )

    SELECT H.house_id, 0.owner_name, H.data_0, H.data_1, H.data_2,

    Score = (

    CASE WHEN H.data_0 = -1 THEN 1

    ELSE 100 * (CASE WHEN H.data_0 = @data_0_value THEN 1 ELSE 0 END)

    END

    +CASE WHEN H.data_1 = -1 THEN 1

    ELSE 100 * (CASE WHEN H.data_1 = @data_1_value THEN 1 ELSE 0 END)

    +CASE WHEN FTS_FT_RANK IS NULL THEN 0

    ELSE 100 * CAST(FTS.FT_RANK as DECIMAL)/@topRank) END

    )

    FROM tbl_houses H

    LEFT JOIN query_fts FTS

    ON FTS.house_id = H.house_id

    INNER JOIN tbl_owners O ON H.owner_id = O.owner_id

    WHERE H.exclusion_criteria = @exclusion_criteria_value

    ORDER BY Score DESC;

    Do you know any other technique that would offer me this?

    Would you have any advice?

    Many thanks in advance,

  • If I've understood you correctly, I think you could go about your problem in the following way. I'm talking at a conceptual level, as it would take too much time to put this into your example - but I'm sure you can do that yourself. Also, you'd obviously want to replace all the "SELECT *" references.

    Example: keywords are 'detached', 'centralheating'

    SELECT tbl_houses.*,CASE WHEN tkw1.house_id IS NULL THEN 0 ELSE 1 END + CASE WHEN tkw2.house_id IS NULL THEN 0 ELSE 1 END AS score

    FROM tbl_houses th

    LEFT JOIN (SELECT * FROM tbl_keywords WHERE keyword='detached') tkw1 ON tkw1.house_id=th.house_id

    LEFT JOIN (SELECT * FROM tbl_keywords WHERE keyword='centralheating') tkw2 ON tkw2.house_id=th.house_id

    ORDER BY score DESC

    Does that get you some of the way?

  • Thanks for your reply but the example you gave does not give any score if keywords are present or not and, I would like to weight the results, also by the presence of the keywords...

  • Hi again.

    It would give you a score if the keywords were not present - the LEFT JOIN means that you get a row returned, and the CASE statement allows you to set the score for when the keyword isn't present (I've used 0, but you don't have to).

    I'm not sure what else you're after - could you be clearer?

  • That solution would require a large amount of dynamic sql with potentially hundreds of joins to the same table. I have an idea that would work using a string splitter. I will see if I can put it together.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Can you put together some ddl and sample data? It sure make life a lot easier without having to create your structures and data. 😉

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (12/13/2011)


    Can you put together some ddl and sample data? It sure make life a lot easier without having to create your structures and data. 😉

    A list of the most frequently encountered keywords would be tremendously useful too. It would be a lot simpler to match '3' to a column [Bedroom_count] than to hope to find the string "three bedrooms" in a huge pile of text 😉


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Sean Lange (12/13/2011)


    That solution would require a large amount of dynamic sql with potentially hundreds of joins to the same table.

    Sure, my solution isn't necessarily the best by any means. But hundreds of joins? How many keywords are you going to be searching on for a house? Rightmove, the leading UK property search site, searches on fewer than 10 criteria, and many of those are non-keyword fields like price. As to the large amount of dynamic SQL: if you want to wrap this in a Stored Proc then yes, that's an issue, but it would seem to make more sense to generate this SQL in the webform, in which case there's nothing too complex about it at all.

    As I say, my solution may not be brilliant - but knock it for the right reasons! 😉

  • integritec (12/14/2011)


    Sean Lange (12/13/2011)


    That solution would require a large amount of dynamic sql with potentially hundreds of joins to the same table.

    Sure, my solution isn't necessarily the best by any means. But hundreds of joins? How many keywords are you going to be searching on for a house? Rightmove, the leading UK property search site, searches on fewer than 10 criteria, and many of those are non-keyword fields like price. As to the large amount of dynamic SQL: if you want to wrap this in a Stored Proc then yes, that's an issue, but it would seem to make more sense to generate this SQL in the webform, in which case there's nothing too complex about it at all.

    As I say, my solution may not be brilliant - but knock it for the right reasons! 😉

    LOL ok so up to 10 dynamic joins. Not putting this in a stored proc is an invitation to sql injection. Pass through queries from a web form are incredibly dangerous. If you do this as a pass through you have to parameterize your query at the very least.

    I don't know anything about Rightmove or what the leading UK property companies experience for the typical number of search criteria. I also don't see how that is relevant to the OP here.

    I don't know anything about the webform the OP is using but if you read the OP it is entered as freetext. That means the user can type in whatever they want. If the UI allows that much text then the code has to be able to deal with it. The only way to search for all of them is to first parse their entry into words.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (12/14/2011)I don't know anything about Rightmove or what the leading UK property companies experience for the typical number of search criteria. I also don't see how that is relevant to the OP here.

    Because he was talking about searching a property database. Seemed a logical comparison to draw.


    Agreed on SQL injection - needs to be guarded against. Doesn't make the solution bad - you just have to sanitise your inputs.

    As to doing 10 JOINs - maybe less than ideal, but usually far more efficient than a single function call.

  • integritec (12/14/2011)


    Agreed on SQL injection - needs to be guarded against. Doesn't make the solution bad - you just have to sanitise your inputs.

    As to doing 10 JOINs - maybe less than ideal, but usually far more efficient than a single function call.

    True enough. "Sanitizing" inputs is a slippery slope though. Much better to parameterize your queries. No matter the level of cleansing you just can't possibly clean it to the point that executing code from an end user is safe. It is easy to do things like inject hex or otherwise obfuscate input to circumvent any kind of cleansing mechanism.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply