Search Engine on Multiple Words

  • Hmmm. will think a bit more. Do you have some sample data please (as insert statements if possible)?

    The reason I suggested Full Text is that it does automatically what you're trying to do by hand. Create lists of words and occurences and allows quick and easy searching

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • In this example I have 1 x url http://www.anywebsite.com and on that page it has two words, 'hello' and 'goodbye'

    If I do a SQL query for both words i.e. AND, it should return just this page, not any other pages that may be listed with either of these pages! Thanks.

    CREATE

    USE

    [SearchEngine]

    GO

    /****** Object: Table [dbo].[tblOccurrence] Script Date: 07/16/2007 14:05:54 ******/

    SET

    ANSI_NULLS ON

    GO

    SET

    QUOTED_IDENTIFIER ON

    GO

    CREATE

    TABLE [dbo].[tblOccurrence](

    [occurrence_id] [int]

    IDENTITY(1,1) NOT NULL,

    [word_id] [int]

    NOT NULL,

    [page_id] [int]

    NOT NULL,

    [datetime] [datetime]

    NOT NULL,

    CONSTRAINT [PK_tblOccurrence] PRIMARY KEY CLUSTERED

    (

    [occurrence_id]

    ASC

    )

    WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    )

    ON [PRIMARY]

    CREATE

    TABLE [dbo].[tblWord](

    [word_id] [int]

    IDENTITY(1,1) NOT NULL,

    [word] [nvarchar]

    (50) COLLATE Latin1_General_CI_AS NOT NULL,

    [datetime] [datetime]

    NOT NULL,

    CONSTRAINT [PK_tblWord] PRIMARY KEY CLUSTERED

    (

    [word_id]

    ASC

    )

    WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    )

    ON [PRIMARY]

    CREATE

    TABLE [dbo].[tblPage](

    [page_id] [int]

    IDENTITY(1,1) NOT NULL,

    [nvarchar]

    (50) COLLATE Latin1_General_CI_AS NOT NULL,

    [datetime] [datetime]

    NOT NULL,

    CONSTRAINT [PK_tblPage] PRIMARY KEY CLUSTERED

    (

    [page_id]

    ASC

    )

    WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    )

    ON [PRIMARY]

    DATA

    INSERT INTO [SearchEngine].[dbo].[tblWord]

    ([word])

    VALUES

    (<'hello', nvarchar(50),&gt

    INSERT

    INTO [SearchEngine].[dbo].[tblWord]

    ([word])

    VALUES

    (<'goodbye', nvarchar(50),&gt

    INSERT

    INTO [SearchEngine].[dbo].[tblOccurrence]

    ([word_id]

    ,[page_id])

    VALUES

    (<1, int,>

    ,<1, int,&gt

    INSERT

    INTO [SearchEngine].[dbo].[tblOccurrence]

    ([word_id]

    ,[page_id])

    VALUES

    (<2, int,>

    ,<1, int,&gt

    INSERT

    INTO [SearchEngine].[dbo].[tblPage]

    ()

    VALUES

    (<'http://www.anywebsite.com', nvarchar(50),&gt

  • I had to write something once that had to search either for words in any order inside a string, or the exact string.  What I had to search was the entire library of tapes for one of the largest Radio companies in the US....  i.e. millions of records, many millions of words to search.

    (there are good reasons why I did not use full text search for this which I am not going into)

    What I ended up doing was create a table of words, one record for each word in each broadcast....  With that could do an index on each word so no need for a like search.

    To find the record(s) I needed in the search would search the list of words, then count that search result to see if it matched the count of how many words the user had entered to search for.  Eliminate those that had a count that was less than it, and voila had the "anywhere" search result.  If they wanted to search the exact string, I simply ran a where clause on the potential resultset from an "anywhere" search with the exact string. 

    In SQL 2005 this would have been a lot easier since I could have used varchar(max) to store the string instead of text.

    I put some limits on the resultset, i.e. max of 200 returned, but they could further refine the query after it ran the first time to eliminate it.  The query could also limit to what type of tapes, what year the recording was made and a host of other information.

    The search never took more than 4 seconds no matter how complicated a search they wanted to do.

  • sorry mate if i didn't clearly understood the problem..
    i guess i was barking at the wrong tree.. heheheh
    
    btw,that solution that i proposed is well tested..
    it is used for some of our search functionality with multiple
    parameters
    
    anyway, i think  Anders Pedersen' solution is suited for 
    your prob.. though you must ensure that the records(words)
    stored in the db are static..
    it seems reasonable to me, and i have some
    thoughts of presenting that to my tech lead.. 
    
    
    
    
  • Hi Christoper,

    No problem what so ever, I probably explained myself badly!

    I have read Anders reply many times and still trying to get a picture of how it will work!!

    Scenario

    If I had a web page that has three words on it for example:

    http://www.mywebsite.com with words: 'race' 'london' '2007'

    I need to record the web address in my table:

    tblPage

    page_id = 1 (Auto increment)

    url = http://www.mywebsite.com

    date = Date added

    I then need to record the three unique words in my table:

    tblWord

    word_id = 1 (Auto increment)

    word = 'race' (Indexed NO duplicates allowed)

    date = Date added

    word_id = 2

    word = 'london'

    date = Date added

    word_id = 3

    word = '2007'

    date = Date added

    I then need to record the occurrence of each word:

    tblOccurrence

    occurrence_id = 1 (Auto Increment)

    word_id = 1

    page_id = 1

    date = Date added

    occurrence_id = 2

    word_id = 2

    page_id = 1

    date = Date added

    occurrence_id = 3

    word_id = 3

    page_id = 1

    date = Date added

    The dates are to be used to purge old indexed pages.

    In this example the words only occurred once on this page, however, in reality they could appear several times.

    QUERY

    So if the user searches for 'race london 2007' as a complete string, I need to perform a logical AND to pull back this web address, however, doing a logical AND does not and willl not work as it is will never be true, this is my puzzle!

    WILL NOT WORK: WHERE (w.word='race') AND (w.word='london') AND (w.word='2007')

    Doing a logical OR will work but it will return the page even if the user just searches for teh word 'race'

    Does that make sence, or am I coming about this from the wrong angle?

  • Actualy the words are not unique.  They would be unique for a mathing parent ID.  And yes my solution was for fairly static data, new records where added, so I would after a new record was added add the words to the words table.

    In pseudo code.... my tables looked something like this:

    Table: Main Record

    RecordID

    StringToSearchIn

    Bunch of other columns used to limit the search (for example we had TypeOfRecording, Source, TapeType, Author, Actors etc.  Most of these where FK'ed out).

    Table: Words

    RecordID (FK'ed to above)

    Word

    (I filtered out common words like 'a', 'the', etc, both in the Words table and from the search string passed to me)

    If I was searching for records that had the following words in it in any order: "London, Race, 2008" I would search for RecordIDs in the Words table that had those words in it.  If all 3 where found it would return that recordID to the front end, with the entire StringToSearchIn.

    If I was searching for "The London 10K Race in 2008" I would search for records that had "London, 10K, Race, 2008", whatever records that had those 4 words in it would then be returned tothe stored procedure, which would then run select from Main Record table where StringToSearchIn like "%The London 10K Race in 2008%" to only limit the search to records that matched the exact string asked for.

    As I said this worked in about 3-4 seconds, and the "server" it ran on was a normal desktop computer with enough hard drive space.

     

  • I got you now, LOL!!

    I have now modified my tblPage to include a keyword field ntext(), added the words in here also as well as maintaining a unique list of words.

    QUERY1

    select word_id, word from tblWord where word='race' or word='london'

    go

    Get my unique word_id's and pass to next query...

    QUERY2

    select url, count(*) as occurrences from tblPage inner join tbloccurrence on tblPage.page_id = tblOccurrence.page_id

    where ((word_id=2262) or (word_id=2950)) and ((keywords like '%london%') and (keywords like '%race%'))

    group by url

    order by occurrences desc

    go

    Works great, many thanks, much appreciated

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply