September 28, 2018 at 11:52 am
One of my company's testers wrote up a bug about an area in our application returning more results than expected when searching for the letter n.
On further investigation, we found something quite startling -- using the fulltext CONTAINS function to match "n*" , "d*" and "t*" are matching many (but not all) records where the text contains numeric, date and time data, respectively.
For example,
I'm able to replicate this behavior on SQL 2017, 2016 and 2012. I imagine it happens on 2014 and older versions, but I don't have instances handy.
This behavior seems to be irrespective of the use of stoplists and other fulltext catalog and index settings. Setting a specific language may affect the variety of formats of of numeric values returned. I've pasted a repro script below.
Is anyone aware of this behavior? If there are undocumented wildcards, are there similarly undocumented escape characters to prevent unintentional matches?
This is one of those issues where it's hard to believe that this behavior hasn't been discovered previously, but I'm not finding any references to it here on SQLCentral, or in the other usual places.
Thanks in advance,
Rob.
Update: There is a Connect UserVoice issue logged, https://feedback.azure.com/forums/908035-sql-server/suggestions/34442179-full-text-search-contains-single-character-n-mat
which links back to a StackOverflow post with some further examples.
-- create a database, then create a full text catalog, table, index, add some data, run a couple queries
use master
-- drop database FTTestDB
drop database if exists FTTestDB
go
create database FTTestDB
go
use FTTestDB
go
create fulltext catalog Waldo
with accent_sensitivity = on as default
go
create table dbo.FullTextTest (
PhraseId int identity(1, 1)
,Phrase nvarchar(80)
constraint PK_FullTextTest primary key clustered (PhraseId asc)
)
go
insert FullTextTest (Phrase)
Values
('Wednesday Morning 3 am')
,('3')
,('30-Mar-1930 14:15')
,('3 blind mice')
,('Give me 3 steps')
,('10' )
,('300')
,('300 blind mice')
,('0.775E+5' )
,('11:15:32.25' )
,('north of here')
,('time waits for no one' )
,('-5' )
,('-59' )
,('0x123FED')
,('$9.95')
,('1,234,567')
,('12.3:46.2')
,('Neue Datenbankentwickler sind skeptisch.')
go
-- optionally, specify a language code from sys.syslanguages; the matching behavior varies slightly
create fulltext index on dbo.FullTextTest (Phrase /* language 1031 */ )
key index PK_FullTextTest on Waldo
WITH CHANGE_TRACKING AUTO
,STOPLIST = OFF
go
print 'waiting up to 10 seconds for fulltext population ... ';
declare @ExpireTime datetime2 = DATEADD(ms, 10000, sysutcdatetime());
while (
fulltextcatalogproperty(N'Waldo', 'PopulateStatus') > 0 -- 0 -> Idle
and sysutcdatetime() < @ExpireTime
)
begin
waitfor delay '00:00:00.5' -- check every half second @recheckIncrement;
end
if fulltextcatalogproperty(N'Waldo', 'PopulateStatus') > 0
print 'not populated'
else
print 'population complete'
select PhraseId, Phrase as [LIKE 'n%']
from FullTextTest v
where Phrase like 'n%';
select PhraseId, Phrase as [CONTAINS (Phrase,"n*")]
from FullTextTest v
where contains ( v.Phrase,'"n*"');
select PhraseId, Phrase as [CONTAINS (Phrase,"d*")]
from FullTextTest v
where contains ( v.Phrase,'"d*"');
select PhraseId, Phrase as [CONTAINS (Phrase,"t*")]
from FullTextTest v
where contains ( v.Phrase,'"t*"');
October 1, 2018 at 7:52 am
Entirely curiosity on my part, but has anyone tried testing that code with the Phrase column being varchar instead of nvarchar ? I don't have the time to do it myself at the moment...
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
October 5, 2018 at 11:23 am
rschoedel - Friday, September 28, 2018 11:52 AMOne of my company's testers wrote up a bug about an area in our application returning more results than expected when searching for the letter n.
On further investigation, we found something quite startling -- using the fulltext CONTAINS function to match "n*" , "d*" and "t*" are matching many (but not all) records where the text contains numeric, date and time data, respectively.For example,
I'm able to replicate this behavior on SQL 2017, 2016 and 2012. I imagine it happens on 2014 and older versions, but I don't have instances handy.
This behavior seems to be irrespective of the use of stoplists and other fulltext catalog and index settings. Setting a specific language may affect the variety of formats of of numeric values returned. I've pasted a repro script below.Is anyone aware of this behavior? If there are undocumented wildcards, are there similarly undocumented escape characters to prevent unintentional matches?
This is one of those issues where it's hard to believe that this behavior hasn't been discovered previously, but I'm not finding any references to it here on SQLCentral, or in the other usual places.
Thanks in advance,
Rob.Update: There is a
ConnectUserVoice issue logged, https://feedback.azure.com/forums/908035-sql-server/suggestions/34442179-full-text-search-contains-single-character-n-mat
which links back to a StackOverflow post with some further examples.-- create a database, then create a full text catalog, table, index, add some data, run a couple queries
use master-- drop database FTTestDB
drop database if exists FTTestDB
go
create database FTTestDBgo
use FTTestDB
gocreate fulltext catalog Waldo
with accent_sensitivity = on as default
gocreate table dbo.FullTextTest (
PhraseId int identity(1, 1)
,Phrase nvarchar(80)
constraint PK_FullTextTest primary key clustered (PhraseId asc)
)
goinsert FullTextTest (Phrase)
Values
('Wednesday Morning 3 am')
,('3')
,('30-Mar-1930 14:15')
,('3 blind mice')
,('Give me 3 steps')
,('10' )
,('300')
,('300 blind mice')
,('0.775E+5' )
,('11:15:32.25' )
,('north of here')
,('time waits for no one' )
,('-5' )
,('-59' )
,('0x123FED')
,('$9.95')
,('1,234,567')
,('12.3:46.2')
,('Neue Datenbankentwickler sind skeptisch.')go
-- optionally, specify a language code from sys.syslanguages; the matching behavior varies slightly
create fulltext index on dbo.FullTextTest (Phrase /* language 1031 */ )
key index PK_FullTextTest on Waldo
WITH CHANGE_TRACKING AUTO
,STOPLIST = OFF
goprint 'waiting up to 10 seconds for fulltext population ... ';
declare @ExpireTime datetime2 = DATEADD(ms, 10000, sysutcdatetime());while (
fulltextcatalogproperty(N'Waldo', 'PopulateStatus') > 0 -- 0 -> Idle
and sysutcdatetime() < @ExpireTime
)
begin
waitfor delay '00:00:00.5' -- check every half second @recheckIncrement;
endif fulltextcatalogproperty(N'Waldo', 'PopulateStatus') > 0
print 'not populated'
else
print 'population complete'select PhraseId, Phrase as [LIKE 'n%']
from FullTextTest v
where Phrase like 'n%';select PhraseId, Phrase as [CONTAINS (Phrase,"n*")]
from FullTextTest v
where contains ( v.Phrase,'"n*"');select PhraseId, Phrase as [CONTAINS (Phrase,"d*")]
from FullTextTest v
where contains ( v.Phrase,'"d*"');select PhraseId, Phrase as [CONTAINS (Phrase,"t*")]
from FullTextTest v
where contains ( v.Phrase,'"t*"');
It would be the same with nvarchar and varchar. Using the contains prefix match wildcard isn't like doing a regular wildcard search. Fulltext search and the regular string searches we are used to are different. Fulltext search knows about words and phrases and that's what it was designed for so it's going to be different. Those queries with single letters are going to have unexpected results. If you want to see why the t* and n* return so many rows, take a look at how things are indexed for the searches using sys.dm_fts_index_keywords. For this database and table, you would do something like: USE FTTestDB
GO
SELECT * FROM sys.dm_fts_index_keywords(DB_ID('FTTestDB'), OBJECT_ID('dbo.FullTextTest'))
Fulltext search Indexed numbers end up stored prefixed with NN<number> and times are prefixed with TT<number>
Sue
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply