May 25, 2004 at 3:01 pm
On a Win 2K machine this query works, on a 2003 same sql build the
query returns way too many rows.
select ID, DATETIMESTAMP, DIGITALID, CREATIONDATETIME FROM tblStories
AS C (NOLOCK)
INNER JOIN CONTAINSTABLE(tblStories,HEADLINE_EXROUTING,'("NJCTC" or
"APALL") AND "<CATEGORY>V</CATEGORY>"') AS K
ON C.ID = K.
John Zacharkan
May 25, 2004 at 5:34 pm
Asking the obvious here...Other than the OS, what is different. SP Level? SQL Server Configuration? Rows in the Table?
Matt
May 25, 2004 at 6:27 pm
Same sql build and server config. The databases are identical, the row counts are the same.
Think it's the O/S in particular index server
John Zacharkan
May 25, 2004 at 8:02 pm
Ok. The problem that you're running into is a change in the way that Microsoft implemented the Word Breaker functionality for FTS. In Windows 2000, Microsoft implemented the Word Breaker functionality in the infosoft.dll. In Windows 2003, this has been moved (and *improved*) to the langwrbk.dll. The new dll has a different set of rules that define how to break a word apart.
Just looking at your query, I would guess that the problem with your query on Windows 2003 is that the "<CATEGORY>V</CATEGORY>" part of your query is not searching on what you think it is. The Windows 2003 dll defines the rules for text ("V" in your case) near punctuation (">" or "<" in your case).
I don't know the work around...but at least maybe this can point you in the right direction.
May 27, 2004 at 9:44 am
Matt I'm hearing similiar from other sources and appreciate your response. I have quite bit of developement work if we need to adjust to this new behaviour, obviously not please with Microsoft for this change. And it is a change with no recourse other than man hours on our behalf.
I have opened a ticket with Microsoft and will follow up here with their reply.
thank you
John Zacharkan
From John Kane at
http://www.developersdex.com/sql/message.asp?r=4219890&p=1922
Zach,
This is less an issue with XML tags than it is with the OS platform (Win2K
vs. Win2003) and the OS-supplied wordbreaker dll. Specifically, the Windows
2000 Server (Win2K) wordbreaker - infosoft.dll - indexes the "<" and ">" and
they are not removed when in contact with your search word, while under
Windows Server 2003 (Win2003) the are removed
A work around for this on Win2K is to drop and re-create your FT Catalog and
use the Neutral "Language for Word Breaker" for the Item column. However, I
don't *believe* this is true for Win2003 as this is the default behavior.
I'll test this and let you know in a future posting...
Regards,
John
John Zacharkan
June 24, 2004 at 1:39 pm
Reply back from Microsoft
Hello John,
This is Rob with Microsoft SQL Server Support. This is a follow up to the voice mail that my manager and I left for you today regarding your issue where Full Text Search on Windows 2003 does not break on words in the same manner as Windows 2000.
We did hear back from the MSSearch developers today. They are the development team responsible for the changes to the wordbreaker functionality. They reinforced this change in behavior for Windows 2003 is by design in order to support HTML Tags as breaking words and allow users the ability to search on a particular HTML Tag. We hadn't sent you what exactly the wordbreaker is doing, I've included at the bottom of this email an example that shows how the wordbreaker is separating '<Category>v</Category>' between Windows 2003 and Windows 2000.
Going forward, we can do a formal submission to request a change be made so that the previous functionality would be supported. Regretably, even if the change is approved, it will likely be some time before this would be available for use and would likely not change the fact you all would have to use one of the following workarounds in the interim given the severity of impact this current behavior is having on your business. Also, if approved, submitting a change request means you all would be willing to put in place the patched file(s). It's important to understand that these will have minimal testing as you all would be doing the testing as part of the fix to help ensure the problem is resolved. Please let us know if you would like us to move forward with submitting this change request. For the change request process, we will need detailed explaination for why none of the current workarounds can be used. From our discussion, I already have an idea of what some are, but we need it in writing for the request.
POSSIBLE WORKAROUNDS:
==========================
Although none of these are ideal, it is a list of ones we've come up with to date for working around this issue. You all may have thought of others. As mentioned before, even after submitting a change request, one of the following (or one you all have thought of) would need to be used in the interim while the change request goes through its process for approval and if approved, created.
1. Remove all Single Letter noise words from the Noise.enu file. This is the current recommended solution even though this causes significant increase in the Full Text Catalog and there are concerns regarding the performance. Other than the amount of time it takes to do a full population, since we canceled the full population, it is unknown as to how much of a performance hit is encountered. If possible, we would like to try this workaround in a test environment in order to get a better feel for the performance impact. Also, we would like to know how often a full population is done on this full text catalog. The upside to this is it is the least impact as far as code change or OS change and can be implemented right away. The time hit is the amount of time for a full population and the testing for performance impact on a test server.
OR
2. A rewrite of the application to not use a single letter for the <Category></Category>. I know this is costly and time consuming, but the upside is this would be a long term fix for this problem. Another benefit is that with new installs (for example, you set up a new server), no one would have to recall that they'd have to edit the noise file or make any other changes to get the desired behavior.
OR
3. Use the Wordbreaker from Windows 2000 on Windows 2003 (ie. change the registry to use the infosort.dll). This will give the same functionality as you experience today on Windows 2000. However, there are some caveats as this would put this software in an unsupported configuration. It is very likely that any future change in a security fix, hotfix, service pack, upgrade, etc from any application that uses search functionality (SQL Server, Sharepoint, index server,etc). could potentially break this workaround. So, this workaround is seen as a short term fix verses a long term fix and would likely result in you hitting this problem again in the future.
OR
4. Revert back to Windows 2000. Although we'd like to find a way to resolve this issue for you and still use Windows 2003, at this time the critical problem is to reduce the impact to your production environment. In taking that in consideration, it may be best to role your production environment back to using Windows 2000. The upside to this is that there is no code change, testing, or file modifications needed to make this work. The impact is the amount of time and effort to put back in place Windows 2000.
The following is the example I promised to include which demonstrates the current functionality of the wordbreaker on '<CATEGORY>v</CATEGORY>' for both Windows 2003 and 2000. You can see that the Windows 2003 is breaking on the html tag leaving 'v' as its own word which is considered a noise word.
WINDOWS 2003:
=================
U:\sdnt\inetsrv>lrtest /b /c:{80A3E9B0-A246-11D3-BB8C-0090272FA362} "<CATEGORY>v</CATEGORY>
dll loaded: U:\WINDOWS\system32\LangWrbk.dll
dll version 5.2.3790.0
file create time: 11-03-2003 11:35a
file last write time: 3-25-2003 5:00a
FileVersion: '5.2.3790.0 (srv03_rtm.030324-2048)'
FileDescription: 'English wordbreaker'
CompanyName: 'Microsoft Corporation'
ProductName: 'Microsoft« Windows« Operating System'
Wordbreaker requires license: No
Wordbreaker license: 'Copyright Microsoft Inc.'
Original text: '<CATEGORY>v</CATEGORY>'
IWordSink:utWord: cwcSrcLen 8, cwcSrcPos 1, cwc 8, 'CATEGORY'
IWordSink:utWord: cwcSrcLen 1, cwcSrcPos 10, cwc 1, 'v'
IWordSink:utWord: cwcSrcLen 8, cwcSrcPos 13, cwc 8, 'CATEGORY'
NOTE: As you can see, the word breaker is using the "<" and ">" as characters for breaking words on.
WINDOWS 2000:
============
U:\sdnt\inetsrv>lrtest /b /c:{59e09848-8099-101b-8df3-00000b65c3b5} "<CATEGORY>v</CATEGORY>
dll loaded: U:\WINDOWS\system32\infosoft.dll
dll version 5.2.3790.0
file create time: 11-03-2003 11:34a
file last write time: 3-25-2003 5:00a
FileVersion: '5.2.3790.0 (srv03_rtm.030324-2048)'
FileDescription: 'Wordbreaker and stemmer dll'
CompanyName: 'Microsoft Corporation'
ProductName: 'Microsoft« Windows« Operating System'
Wordbreaker requires license: Yes
Wordbreaker license: 'Copyright Infosoft International Inc.'
Original text: '<CATEGORY>v</CATEGORY>'
IWordSink:utAltWord: cwcSrcLen 21, cwcSrcPos 0, cwc 21, '<CATEGORY>v</CATEGORY'
IWordSink:utWord: cwcSrcLen 21, cwcSrcPos 0, cwc 12, '<CATEGORY>v<'
IWordSink:utAltWord: cwcSrcLen 21, cwcSrcPos 0, cwc 1, '1'
IWordSink:utWord: cwcSrcLen 21, cwcSrcPos 0, cwc 8, 'CATEGORY'
NOTE: As you can see, the < and > are not word separators and therefore <Category>v</Category> is treated as one word.
Let me know what questions you have.
Sincerely,
Rob Beene, MCSD, MCDBA
VOICE: (980) 776-8613
EMAIL: RBeene@Microsoft.com <mailto:RBeene@Microsoft.com> <mailto:RBeene@Microsoft.com><mailto:RBeene@Microsoft.com>
HOURS: MondayThrough Friday to 7:00 AM to 4:00 PM Eastern Time
John Zacharkan
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply