July 14, 2007 at 10:08 am
Hi all,
I'm currently putting together a simple search engine and having a little problem searching using multiple keywords!
I have three tables:
tblPage
page_id
url
tblWord
word_id
word
tblOccurrence
occurrence_id
page_id
word_id
I have an ASP page that reads a URL, gathers all of the words from the page then populates the tblWord table with the word it has found and then the tblOccurrence table with the page and word id's it has found.
The search ASP file then queries these tables returning the url in the order of occurrence.
The query I have is:
SELECT tblPage.url AS url,
COUNT(*) AS occurrences
FROM tblPage p, tblWord w, tblOccurrence o
WHERE (p.page_id = o.page_id) AND
(w.word_id = o.word_id) AND
(w.word='<%= sKeyword %>')
GROUP BY p.url
ORDER BY occurrences DESC
This works great for one keyword, however, I need to adapt this query to cater for multiple keywords!
Any ideas????
I could do a logical OR, which works but does not return a page if two different words appear on a page!
Many thanks...
Mike
July 15, 2007 at 6:57 am
you need to seperate words in a where clause
where w.word like'<%= sKeyword %> and w.word like'<%= sKeyword %>
for sepeate words
for example vb,asp
where w.word like '<%= vb %> and/or w.word like '<%= asp %>
<a href="http://www.websolsoftware.com"> For IT jobs click here</a>
*Sukhoi*[font="Arial Narrow"][/font]
July 15, 2007 at 7:46 am
Hi,
Many thanks for replying!
I thought that should work, however, I'd already tried this and it returns no records!
The logicl OR is fine, i.e. give me a page if the word 'London' OR 'Race' appears within the page.
If I have in:
tblPage
page_id = 1
url = 'http://www.londonrace.com'
tblWord
word_id = 1
word = 'london'
word_id = 2
word = 'race'
tblOccurrence
occurrence_id = 1
page_id = 1
word_id = 1
occurrence_id = 2
page_id = 1
word_id = 2
With an AND should it still return a valid record?
Thanks.
Mike
July 15, 2007 at 8:11 am
what is the db are you using, and show me the exact query you tried, you need to use like instead of "="
<a href="http://www.websolsoftware.com"> For IT jobs click here</a>
*Sukhoi*[font="Arial Narrow"][/font]
July 15, 2007 at 10:10 am
It is SQL Server 2005.
I'm thinking that the way I'm storing the keywords is probably not the best way to achieve this!
Even if you do a basic search on the tblWord with an AND you will not get the result as it is never true:
NO - select
* from tblWord where word = 'race' and word = 'racer'
OK - select * from tblWord where word = 'race' or word = 'racer'
OK - select * from tblWord where word like 'race%' or word like 'looks%'
I need something like the mySQL TSQL command REGEXP!
!!!!!!!
July 16, 2007 at 1:37 am
Have you considered using the fulltext search functionality for this?
With fulltext you probably won't even need the keywords tables. Put a full text index on the documents (assuming they're stored in a database) and you can use the CONTAINS keyword pull out actaul or near matches.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 16, 2007 at 2:54 am
create procedure usp_SimpleSearchEngine ( @xmlKeywords nvarchar(max) ) as --<root> --<item word=""/> --</root> begin declare @hdoc INT declare @xml nvarchar(max) declare @Keywords table(word nvarchar(200)) --parse the xml and save them in the variable table exec sp_xml_preparedocument @hdoc output, @xmlKeywords insert into @Keywords(word) select tmp.word from openxml (@hdoc, '/root/item',2) with(word nvarchar(128) '@word')tmp exec sp_xml_removedocument @hdoc --select all the records that matches our criteria selecttblPage.url as 'url' , count(*) as 'occurrences' fromtblPage p jointblOccurrence o on((o.page_id = p.page_id)) jointblWord w on (w.word_id = o.word_id) whereexists(select 1 from @Keywords k where w.word like '%' + k.word + '%' ESCAPE '\') order by p.url asc , occurrences DESC end go /*usage exec usp_SimpleSearchEngine @xmlKeywords=N'<root> <item word="race"/><item word="racer"/> </root>' */ ------------------------------------------------------------------------------------ or just simply use this simple sproc to do the job.. its only logic is that it accept a list of keywords, store them in a variable table(don't use temp table) and check for a match if any of them exists in the database (by using the EXISTS keyword in the WHERE clause) if there is any question, suggestion, or anything you may want to ask, send me an email . it will be highly appreciated..
July 16, 2007 at 3:04 am
Hi Gail & Christopher,
Many thanks for your replies, greatly appreciated!
The production server I have is a remote SQL Server 7 that I do not have a major control over (ISP offered service)!
The SQL Server 2005, is on my laptop and used for development.
Not played around with the FULL TEXT within SQL so need to have a read up on this, the procedure above looks interesting, I will give it a go and let you know how I get on.
Many many thanks for all of your help.
Mike
July 16, 2007 at 4:31 am
Hi Christopher,
When I added your stored procedure, I get the following error:
Msg 4104, Level 16, State 1, Procedure usp_SimpleSearchEngine, Line 19
The multi-part identifier "tblPage.url" could not be bound.
Have I missed something?? Thanks Mike
use
SearchEngine
go
create
procedure usp_SimpleSearchEngine
(
@xmlKeywords
nvarchar(max)
)
as
--<root>
-- <item word=""/>
--</root>
begin
declare @hdoc INT
declare @xml nvarchar(max)
declare @Keywords table(word nvarchar(200))
-- parse the xml and save them in the variable table
exec sp_xml_preparedocument @hdoc output, @xmlKeywords
insert into @Keywords(word)
select tmp.word from openxml (@hdoc, '/root/item',2) with(word nvarchar(128) '@word')tmp
exec sp_xml_removedocument @hdoc
-- select all the records that matches our criteria
select tblPage.url as url, count(*) as 'occurrences'
from tblPage p
join tblOccurrence o on (o.page_id = p.page_id)
join tblWord w on (w.word_id = o.word_id)
where exists(select 1 from @Keywords k where w.word like '%' + k.word + '%' ESCAPE '\')
order by p.url asc, occurrences DESC
end
go
July 16, 2007 at 4:37 am
---------------------------------------------------------- -- select all the records that matches our criteria selectp.url as 'url' , count(*) as 'occurrences' fromtblPage p jointblOccurrence o on((o.page_id = p.page_id)) jointblWord w on (w.word_id = o.word_id) whereexists(select 1 from @Keywords k where w.word like '%' + k.word + '%' ESCAPE '\') order by p.url asc , occurrences DESC ---------------------------------------------------------- i misspelled the Alias on the tblPage table.. just replace the query with the one above.. sorry bout that mate.. just pure negligence... hehehhe
July 16, 2007 at 4:41 am
I had changed the alias name to p.url, except I get the following error, so I poped it back !
Msg 8120, Level 16, State 1, Procedure usp_SimpleSearchEngine, Line 21
Column 'tblPage.url' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
July 16, 2007 at 5:07 am
May of sorted it !!!!
I changed to group by p.url order by occurrances DESC
will test
July 16, 2007 at 5:29 am
Hi again,
I managed to get it to work and execute fine now, however, don't you just hate it when somebody says that!
exec usp_SimpleSearchEngine @xmlKeywords
='<root><item word="mjd1"/><item word="mjd2"/></root>'
go
SELECT
url, COUNT(*) AS occurrences FROM tblOccurrence o1
INNER JOIN tblWord w1 ON w1.word_id = o1.word_id
INNER JOIN tblPage p ON p.page_id = o1.page_id
WHERE (w1.word like '%mjd1%') or (w1.word like '%mjd2%')
GROUP BY p.url
ORDER BY occurrences DESC
In these example I have two words that appear on one page 'mjd1' and 'mjd2', I do not want a page to be returned if just one word is present as it is a logical AND that is needed.
The results both look as if they are logically OR'ing.
Hope I have explained myself ok??????
Can the stored procedure by modified?
Thanks.
Mike
July 16, 2007 at 6:14 am
Try adding the following after the group by, before the order by. The filter there will depend on the no on keywords. No promices, I'm guessing here.
HAVING COUNT(distinct w1.word) > 1 -- the count to look for depends on the number of keywords specified.
Also, a word of warning. Get more than a couple thousand words and your search is going to run like molasses. Because of the wildcard on the word, SQL will not be able to use any index on the word table. With small amounts of data you won't notice a difference. Once the tables get large, you may end up with a rather slow running query.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 16, 2007 at 6:23 am
Thanks tried that, still looks like a logical OR!
Beginning to think I'm going about this the wrong way!
Word table is approx 6000, occurrence is approx 35,000
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply