December 13, 2007 at 8:40 pm
I'm after some guidance on how the addition of a wildcard character to a word in a CONTAINS clause affects the results being returned by a client's application.
Firstly, some technicals:
SQL Server 2005 Standard Edition v9.00.2153.00 SP running on
Windows Server 2003 Standard edition SP1
Secondly, the search is using the noiseENU.txt noise word file which contains the word "No" which I assume has a major bearing on the examples below. Note that all single numeric and alphabet characters have been removed from the file.
When the client application runs the following T-SQL:
SELECT
dmename
FROM DOCUMENT
WHERE CONTAINS(dmename,'"gordon" & "machine" & "no.3"')
.. the result set returns 800 records with results such as:
O-47832 SYSTEM - STEELWORK DETAILS, GORDON PS, NO. 3 MACHINE
O-47831 SUPPORTS AND HANGERS, GORDON PS, NO.3 MACHINE EXCITATION BUSDUCTS
O-47678 GORDON POWER STATION - NO.3 MACHINE PROCEDURE FOR DISMANTLING TURBINE
However, when a wildcard character is added to "No.3", ie:
SELECT
dmename
FROM DOCUMENT
WHERE CONTAINS(dmename,'"gordon" & "machine" & "no.3*"')
.. only 11 records are returned, such as:
F-01039 Gordon Power Station, No. 3 Machine Generator, Drawing Index
A3-03116 DRAFT TUBE AIR ADMISSION LEAKAGE, GORDON POWER STATION, No. 3 Machine
A1-06405 GORDON POWER STATION, MACHINE NO. 3 DRAFT TUBE CROWN DOOR SEAL DETAILS
In the latter query all the results return document names where there is a space between "No." and "3" but no results where the document name contains the substring "No.3". As I mentioned at the start, I have a strong hunch that the inclusion of the "No" word in the noise file is related to why this occurs but I'm struggling to get my head around what exactly is happening.
Any illumination will be greatly appreciated.
🙂
December 14, 2007 at 7:08 am
A few things:
1. Punctuation characters, e.g. the period (.) are removed from indexing and/or search.
1.1 So this results in a search of "no 3"
2. Single digits, e.g., the numeral 3 is considered a "noise" word.
2.1 So now the search term is just "no".
Therefore, the resulting search term ends up being just "no".
Comment and suggestion: The "noise" word list is going to cause all sorts of grief for your end users. You'll get lots of questions of "why didn't it find " ". You'll almost need to have the list of terms available at the UI. Try replacing the "noise" word list with an empty list. Microsoft has documentation on how to do this is. Note that you'll have to reindex the data.
December 16, 2007 at 2:53 pm
Thanks JohnG
As I mentioned towards the top of my post all the single numeric and alphabet characters have been removed from the noise file. Would "3" still be treated as a noise word in this case?
December 17, 2007 at 7:12 am
Sorry, my apology - I didn't read your posting thoroughly.
The term "no" is also considered a "noise word". It is in the noiseENU.txt file. Therefore, it is also removed from the query.
I think the difference in behavior is due to the way SQL Server functions as it pertains to the wildcard character (along with punctuation and "noise" word handling). Key note: When the wildcard character (*) is applied to a phrase ALL terms in the phrase are wildcarded, not just the terms with the asterisk. From the BOL:
"When is a phrase, each word contained in the phrase is considered to be a separate prefix. Therefore, a query specifying a prefix term of "local wine*" matches any rows with the text of "local winery", "locally wined and dined", and so on."
Our application also uses the full-text feature. We've had to go through a lot of experimentation and testing in order to pin down the exact behavior. Followed up with a companion "searching best practices" document for our end users.
December 17, 2007 at 3:21 pm
Looks like we may have to do the same.
Thanks for your advice
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply