Undocumented "feature"(?) : full text CONTAINS (column_name, '"n*"') is matching numeric values

  • One of my company's testers wrote up a bug about an area in our application returning more results than expected when searching for the letter n.  
    On further investigation, we found something quite startling -- using the fulltext CONTAINS function to match "n*" , "d*" and "t*" are matching many (but not all) records where the text contains numeric, date and time data, respectively.   

    For example, 

    I'm able to replicate this behavior on SQL 2017, 2016 and 2012.  I imagine it happens on 2014 and older versions, but I don't have instances handy. 
    This behavior seems to be irrespective of the use of stoplists and other fulltext catalog and index settings.  Setting a specific language may affect the variety of formats of of numeric values returned.  I've pasted a repro script below.

    Is anyone aware of this behavior?  If there are undocumented wildcards, are there similarly undocumented escape characters to prevent unintentional matches?
    This is one of those issues where it's hard to believe that this behavior hasn't been discovered previously, but I'm not finding any references to it here on SQLCentral, or in the other usual places.
    Thanks in advance,
    Rob.

    Update:  There is a Connect UserVoice issue logged, https://feedback.azure.com/forums/908035-sql-server/suggestions/34442179-full-text-search-contains-single-character-n-mat
    which links back to a StackOverflow post with some further examples.

    -- create a database, then create a full text catalog, table, index, add some data, run a couple queries
    use master

    -- drop database FTTestDB
    drop database if exists FTTestDB
    go
    create database FTTestDB

    go
    use FTTestDB
    go

    create fulltext catalog Waldo
     with accent_sensitivity = on as default
    go

    create table dbo.FullTextTest (
      PhraseId int identity(1, 1)
     ,Phrase nvarchar(80)
     constraint PK_FullTextTest primary key clustered (PhraseId asc)
     )
    go

    insert FullTextTest (Phrase)
    Values
      ('Wednesday Morning 3 am')
      ,('3')
      ,('30-Mar-1930 14:15')
      ,('3 blind mice')
      ,('Give me 3 steps')
      ,('10' )
      ,('300')
      ,('300 blind mice')
      ,('0.775E+5' )
      ,('11:15:32.25' )
      ,('north of here')
      ,('time waits for no one' )
      ,('-5' )
      ,('-59' )
      ,('0x123FED')
      ,('$9.95')
      ,('1,234,567')
      ,('12.3:46.2')
      ,('Neue Datenbankentwickler sind skeptisch.')

    go

    -- optionally, specify a language code from sys.syslanguages; the matching behavior varies slightly
    create fulltext index on dbo.FullTextTest (Phrase /* language 1031 */ )
      key index PK_FullTextTest on Waldo
      WITH CHANGE_TRACKING AUTO
       ,STOPLIST = OFF
    go

    print 'waiting up to 10 seconds for fulltext population ... ';
    declare @ExpireTime datetime2 = DATEADD(ms, 10000, sysutcdatetime());

    while (
      fulltextcatalogproperty(N'Waldo', 'PopulateStatus') > 0 -- 0 -> Idle
      and sysutcdatetime() < @ExpireTime
      )
    begin
     waitfor delay '00:00:00.5' -- check every half second @recheckIncrement;
    end

    if fulltextcatalogproperty(N'Waldo', 'PopulateStatus') > 0
      print 'not populated'
    else
      print 'population complete'

    select PhraseId, Phrase as [LIKE 'n%']
    from FullTextTest v
    where Phrase like 'n%';

    select PhraseId, Phrase as [CONTAINS (Phrase,"n*")]
    from FullTextTest v
    where contains ( v.Phrase,'"n*"');

    select PhraseId, Phrase as [CONTAINS (Phrase,"d*")]
    from FullTextTest v
    where contains ( v.Phrase,'"d*"');

    select PhraseId, Phrase as [CONTAINS (Phrase,"t*")]
    from FullTextTest v
    where contains ( v.Phrase,'"t*"');

  • Entirely curiosity on my part, but has anyone tried testing that code with the Phrase column being varchar instead of nvarchar ?   I don't have the time to do it myself at the moment...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • rschoedel - Friday, September 28, 2018 11:52 AM

    One of my company's testers wrote up a bug about an area in our application returning more results than expected when searching for the letter n.  
    On further investigation, we found something quite startling -- using the fulltext CONTAINS function to match "n*" , "d*" and "t*" are matching many (but not all) records where the text contains numeric, date and time data, respectively.   

    For example, 

    I'm able to replicate this behavior on SQL 2017, 2016 and 2012.  I imagine it happens on 2014 and older versions, but I don't have instances handy. 
    This behavior seems to be irrespective of the use of stoplists and other fulltext catalog and index settings.  Setting a specific language may affect the variety of formats of of numeric values returned.  I've pasted a repro script below.

    Is anyone aware of this behavior?  If there are undocumented wildcards, are there similarly undocumented escape characters to prevent unintentional matches?
    This is one of those issues where it's hard to believe that this behavior hasn't been discovered previously, but I'm not finding any references to it here on SQLCentral, or in the other usual places.
    Thanks in advance,
    Rob.

    Update:  There is a Connect UserVoice issue logged, https://feedback.azure.com/forums/908035-sql-server/suggestions/34442179-full-text-search-contains-single-character-n-mat
    which links back to a StackOverflow post with some further examples.

    -- create a database, then create a full text catalog, table, index, add some data, run a couple queries
    use master

    -- drop database FTTestDB
    drop database if exists FTTestDB
    go
    create database FTTestDB

    go
    use FTTestDB
    go

    create fulltext catalog Waldo
     with accent_sensitivity = on as default
    go

    create table dbo.FullTextTest (
      PhraseId int identity(1, 1)
     ,Phrase nvarchar(80)
     constraint PK_FullTextTest primary key clustered (PhraseId asc)
     )
    go

    insert FullTextTest (Phrase)
    Values
      ('Wednesday Morning 3 am')
      ,('3')
      ,('30-Mar-1930 14:15')
      ,('3 blind mice')
      ,('Give me 3 steps')
      ,('10' )
      ,('300')
      ,('300 blind mice')
      ,('0.775E+5' )
      ,('11:15:32.25' )
      ,('north of here')
      ,('time waits for no one' )
      ,('-5' )
      ,('-59' )
      ,('0x123FED')
      ,('$9.95')
      ,('1,234,567')
      ,('12.3:46.2')
      ,('Neue Datenbankentwickler sind skeptisch.')

    go

    -- optionally, specify a language code from sys.syslanguages; the matching behavior varies slightly
    create fulltext index on dbo.FullTextTest (Phrase /* language 1031 */ )
      key index PK_FullTextTest on Waldo
      WITH CHANGE_TRACKING AUTO
       ,STOPLIST = OFF
    go

    print 'waiting up to 10 seconds for fulltext population ... ';
    declare @ExpireTime datetime2 = DATEADD(ms, 10000, sysutcdatetime());

    while (
      fulltextcatalogproperty(N'Waldo', 'PopulateStatus') > 0 -- 0 -> Idle
      and sysutcdatetime() < @ExpireTime
      )
    begin
     waitfor delay '00:00:00.5' -- check every half second @recheckIncrement;
    end

    if fulltextcatalogproperty(N'Waldo', 'PopulateStatus') > 0
      print 'not populated'
    else
      print 'population complete'

    select PhraseId, Phrase as [LIKE 'n%']
    from FullTextTest v
    where Phrase like 'n%';

    select PhraseId, Phrase as [CONTAINS (Phrase,"n*")]
    from FullTextTest v
    where contains ( v.Phrase,'"n*"');

    select PhraseId, Phrase as [CONTAINS (Phrase,"d*")]
    from FullTextTest v
    where contains ( v.Phrase,'"d*"');

    select PhraseId, Phrase as [CONTAINS (Phrase,"t*")]
    from FullTextTest v
    where contains ( v.Phrase,'"t*"');

    It would be the same with nvarchar and varchar. Using the contains prefix match wildcard isn't like doing a regular wildcard search. Fulltext search and the regular string searches we are used to are different. Fulltext search knows about words and phrases and that's what it was designed for so it's going to be different. Those queries with single letters are going to have unexpected results. If you want to see why the t* and n* return so many rows, take a look at how things are indexed for the searches using sys.dm_fts_index_keywords. For this database and table, you would do something like:
    USE FTTestDB
    GO
    SELECT * FROM sys.dm_fts_index_keywords(DB_ID('FTTestDB'), OBJECT_ID('dbo.FullTextTest'))

    Fulltext search Indexed numbers end up stored prefixed with NN<number> and times are prefixed with TT<number>

    Sue

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply