SQL Server Semantic Search (Find Text in External Files)

  • Dear community,

    I am trying to search word documents (docx) for a particular drug name and return the file name of a successful match. I am in the testing phase and with the assistance of Daniel Calbimonte article (link below) I have a better understanding on how to achieve this.

    The issue I am faced with is that I cannot get it to work with word documents that have an extension of docx? It does work however with word documents with an extension of doc.

    I can change the extension from docx to doc with visual basic but I am being extra cautious and would prefer not to go down that path. I may have know alternative but to use the visual basic code to achieve my end goal but hoping not.

    Overall their are tens of thousands of word documents spanning several years. I do intend to create a copy of these files within the file table directory which will ensure the source files remain intact.

    SQL Server 2016 SP3 (SemanticLanguageDatabase and FilterPack installed)

    SQL Server Semantic Search to Find Text in External Files (mssqltips.com)

    • This topic was modified 8 months, 1 week ago by  Stephen Butland. Reason: I edited due to the T-SQL file being refused upload. I created with a txt extension and uploaded
    Attachments:
    You must be logged in to view attached files.
  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Have you downloaded & installed the Microsoft Office 2010 Filter Packs? I'm not clear if that includes SP2, so you might also want Service Pack 2 for Microsoft Office 2010 Filter Pack.

     

  • My opinion - SQL is the wrong tool for the job. I would toss this into a tool that is designed for parsing Office documents such as SharePoint. IF that isn't an option, my GUESS (I haven't done this as I don't have a use case for it) is that the problem will be fixed in newer versions of SQL Server? The link you provided shows that docx is supported (link for those who didn't get it - https://www.mssqltips.com/sqlservertip/2769/sql-server-semantic-search-to-find-text-in-external-files/), in that link, they show that docx is supported when they ran the following SQL:

    USE FileSearchTest
    GO
    SELECT name
    FROM dbo.DocumentSemantics
    WHERE FREETEXT (file_stream, 'After climbing a very high mountain, we discovered that there are many other mountains to climb')

    After running that, they got 2 results, both docx files. My GUESS is that it is related to the SQL version (the page indicates that the feature was enabled in SQL 2012, but doesn't indicate what version they are testing with... it MAY be that they are using a newer version that supports docx), OR it could be that you have some features in the docx files that are not supported. What I mean is if the file is password protected (for example) or the Service doesn't have access to open the file (someone else has it open in exclusive mode for example), then SQL won't be able to do anything with the file.

    I know I am not comfortable opening attachments from random websites as they MAY contain malicious code. The txt fie is PROBABLY safe, but you never know what may happen with the text file interpreter being used (notepad for example). Doc and docx MAY have some code behind them (vb script, macros, etc). So I know I am not downloading those. Plus setting up a test bed for semantic search is a bit outside the scope of support I offer on a free forum. It's a neat tool, but there is no business case for it where I work.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Hi @ratbak,

    Thank you for your response. I have downloaded and installed the Microsoft Office 2010 Filter Pack Sp2. Unfortunately I am experiencing the issue with not recognising docx files with this installed.

     

  • Hi Mr Brian Gale,

    Thank you for your response. Unfortunately SharePoint isn't an option for me.

    You mention a more  recent installation of SQL Server. This is a good question and one that I have not tested. So, I spun up a virtual machine with SQL Server 2022 installed and configured. Installed the Microsoft Office 2010 Filter Pack Sp2. Unfortunately this presented the same result, only showing a positive result for word documents with an extension of doc.

    The text file included this script:-

    USE [master]

    GO

    SELECT SERVERPROPERTY('IsFullTextInstalled') AS [IsFullTextInstalled]

    CREATE DATABASE semanticsdb

    ON ( FILENAME = 'E:\Microsoft SQL Server\Data\semanticsdb.mdf' )

    LOG ON ( FILENAME = 'E:\Microsoft SQL Server\Log\semanticsdb_log.ldf' )

    FOR ATTACH;

    GO

    EXEC sp_fulltext_semantic_register_language_statistics_db @dbname = N'semanticsdb';

    GO

    SELECT * FROM sys.fulltext_semantic_language_statistics_database;

    SELECT * FROM sys.fulltext_semantic_languages;

    USE [master]

    GO

    CREATE DATABASE [FileSearch]

    ON PRIMARY

    (

    NAME = N'FileSearch',

    FILENAME = N'E:\FileTable\FileSearch.mdf'

    ),

    FILEGROUP FilestreamFG CONTAINS FILESTREAM

    (

    NAME = MyFileStreamData,

    FILENAME= 'E:\FileTable\Data'

    )

    LOG ON

    (

    NAME = N'FileSearch_Log',

    FILENAME = N'E:\FileTable\FileSearch_log.ldf'

    )

    WITH FILESTREAM

    (

    NON_TRANSACTED_ACCESS = FULL,

    DIRECTORY_NAME = N'FileTable'

    )

    USE [FileSearch]

    GO

    CREATE TABLE FileSemantics AS FileTable

    WITH

    (

    FILETABLE_DIRECTORY = 'FileSemantics'

    ,FILETABLE_COLLATE_FILENAME = Latin1_General_CI_AI --database_default

    ,FILETABLE_PRIMARY_KEY_CONSTRAINT_NAME = PK_path_locator

    ,FILETABLE_STREAMID_UNIQUE_CONSTRAINT_NAME = UQ_stream_id

    ,FILETABLE_FULLPATH_UNIQUE_CONSTRAINT_NAME = UQ_parent_path_locator_and_name

    );

    GO

    USE [FileSearch]

    GO

    SELECT fs.[stream_id]

    ,fs.[file_stream]

    ,fs.[name]

    ,fs.[path_locator]

    ,fs.[parent_path_locator]

    ,fs.[file_type]

    ,fs.[cached_file_size]

    ,fs.[creation_time]

    ,fs.[last_write_time]

    ,fs.[last_access_time]

    ,fs.[is_directory]

    ,fs.[is_offline]

    ,fs.[is_hidden]

    ,fs.[is_readonly]

    ,fs.[is_archive]

    ,fs.[is_system]

    ,fs.[is_temporary]

    FROM [FileSearch].[dbo].[FileSemantics] AS fs

    ORDER BY fs.[name] ASC

    USE [FileSearch]

    GO

    CREATE FULLTEXT CATALOG [_FileFullTextSemanticSearch] WITH ACCENT_SENSITIVITY = OFF AS DEFAULT AUTHORIZATION [dbo]

    GO

    USE [FileSearch]

    GO

    CREATE FULLTEXT INDEX ON [dbo].[FileSemantics] KEY INDEX [PK_path_locator] ON ([_FileFullTextSemanticSearch]) WITH (CHANGE_TRACKING AUTO)

    GO

    ALTER FULLTEXT INDEX ON [dbo].[FileSemantics] ADD ([file_stream] TYPE COLUMN [file_type] STATISTICAL_SEMANTICS)

    GO

    ALTER FULLTEXT INDEX ON [dbo].[FileSemantics] ADD ([file_type] STATISTICAL_SEMANTICS)

    GO

    ALTER FULLTEXT INDEX ON [dbo].[FileSemantics] ADD ([name] STATISTICAL_SEMANTICS)

    GO

    ALTER FULLTEXT INDEX ON [dbo].[FileSemantics] ENABLE

    GO

    USE [FileSearch]

    GO

    DECLARE @FullTextCatalogName NVARCHAR(250)

    SET @FullTextCatalogName =N'_FileFullTextSemanticSearch'

    SELECT cat.name AS [Name]

    ,cat.fulltext_catalog_id AS [ID]

    ,CAST(FULLTEXTCATALOGPROPERTY(cat.name,'AccentSensitivity') AS bit) AS [IsAccentSensitive]

    ,CAST(cat.is_default AS bit) AS [IsDefault]

    ,dp.name AS [Owner]

    ,FULLTEXTCATALOGPROPERTY(cat.name,'LogSize') AS [ErrorLogSize]

    ,FULLTEXTCATALOGPROPERTY(cat.name,'IndexSize') AS [FullTextIndexSize (MB)]

    ,FULLTEXTCATALOGPROPERTY(cat.name,'ItemCount') AS [ItemCount]

    ,CASE WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (0)

    THEN ('Idle')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (1)

    THEN ('Full Population In Progress')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (2)

    THEN ('Paused')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (3)

    THEN ('Throttled')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (4)

    THEN ('Recovering')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (5)

    THEN ('Shutdown')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (6)

    THEN ('Incremental Population In Progress')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (7)

    THEN ('Building Index')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (8)

    THEN ('Disk Full. Paused')

    WHEN FULLTEXTCATALOGPROPERTY(cat.name,'PopulateStatus') = (9)

    THEN ('Change Tracking')

    END AS PopulateStatus

    FROM sys.fulltext_catalogs AS cat

    LEFT OUTER JOIN sys.filegroups AS fg  ON cat.data_space_id = fg.data_space_id

    LEFT OUTER JOIN sys.database_principals AS dp  ON cat.principal_id = dp.principal_id

    WHERE cat.name = ( @FullTextCatalogName )

    GO

    USE [FileSearch]

    GO

    DECLARE @SearchPhrase NVARCHAR(100)

    SET @SearchPhrase ='venlafaxine'

    SELECT fs.[name]

    FROM [FileSearch].[dbo].[FileSemantics] AS fs

    WHERE FREETEXT(file_stream, @SearchPhrase);

    Each word document had the following entered "find drug name venlafaxine".

    Its probably something really simple that I am missing and can only apologise for my ignorance. Any help would be greatly received.

     

  • Probably a dumb question but did you restart the services after installing the filter pack AND did you install the filter pack on the server (not your local machine)?

    Not trying to question your technical skills, I just have seen others (and myself) make silly mistakes like that before.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Not a dumb question at all. I had restarted the services after installing the filter pack which was installed on the server.

    Still no joy but I think it certainly has something to do with the filter pack just not sure how to resolve. It is as if the server is not recognising the filter pack is installed. I am wondering whether it requires registering, I thought that would be done during the install but maybe I am mistaken.

    Thank you so much for your assistance thus far!

  • Hi Mr Brian Gale, thank you so much for your assistance, I have now got it working.

    I had to execute the following to load the filter pack: -

    EXEC sp_help_fulltext_system_components 'filter';

    EXEC sp_fulltext_service @action='load_os_resources', @value=1;

    EXEC sp_fulltext_service 'update_languages';

    EXEC sp_fulltext_service 'restart_all_fdhosts';

    EXEC sp_help_fulltext_system_components 'filter';

    Again, thank you all for your assistance.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply