Reading data from 0x05

  • Hi

    We currently have a table, with a column of a varbinary(MAX).

    This contains data which precedes with 0x50 – from what I understand means that the data has a ZIP format and I believe it has been compressed.

    We need to be able to read this data.  I know that this is a .docx document.

    Any help is much appreciated.

     

    Many Thanks

    Elle

  • To reduce file size, MS Word documents are compressed using PKZIP format (a form of LZ compression). You can leverage BCP to export contents of a VARBINARY column to a file. In this case, you would be exporting contents from a specific row into a file with .docx extension.

    http://www.sqlusa.com/bestpractices2005/importimage/

     

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • What do you wish to read it with, or how do you wish to process it?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Thanks for your replies.

    We need to read it using either a T-SQL script or SAP Data Services.

    The data will not be extracted in bulk but extracted each hour.

  • Once the varbinary data has been SELECTed, what do you want to happen to it next.

    Using poorly defined terms like 'read' and 'extract' does not make it clear exactly what you are trying to do.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Sorry Phil.

    Thanks, we need to loop/search through the SELECTed data and pick out specific items of data.

    The .docx has been created like a form.

  • So you want to search the text embedded inside a DOCX file that has been stored inside a VARBINARY column, is that right?

    So, notionally, like this?

    SELECT cols 
    FROM tbl
    WHERE VarCol like '%free beer%'

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Yes that's correct.

  • There is no fast way of doing this. As Eric mentioned, DOCX files are zipped XML files.

    Searching through them would involve unzipping them first.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • SQL is probably the wrong tool for this. Not only do you need to unzip the data, but you then need to navigate the internal file structure, extract the relevant XML data and then search it for text data - it might be possible, but it sure won't be pretty.

    You're probably better off using a tool written in VB or C#, Microsoft provide libraries for manipulating Office documents that make the whole process vastly less painful.

  • Thanks for your input.. much appreciated.  I'll take a look into these methods.

  • The fact that the documents are contained inside a database column is actually working against you. If the documents were contained in a folder, then several options would be available for searching: SQL Server Full Text Search, Azure Search, etc. The open source tool dnGrep can search inside .zip, .docx, and .pdf files. It appears to support command line parameters, so if you can extract the contents of the binary object to a file, then perhaps this tool can be leveraged. How practical this would be depends on how many documents you need to scan at a time.

    https://github.com/dnGrep/dnGrep/wiki/Command-Line

    Ideally, whatever application initially inserts the documents into the database should be doing some keyword indexing upfront. You need a table to link each document to instances of your common search terms.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • In theory you could do the following:

    1. Set up a FILESTREAM column to contain the DOCX files. This column could be in the existing table or you could create a new table just to store the DOCX files. Copy all the files into the FILESTREAM column, this stores them in the underlying file system but they are still part of the database and you can process them using TSQL.
    2. Enable Full Text Search (FTS) on your SQL Server.
    3. You can then search the DOCX files using the FTS operators such as CONTAINS, FREETEXT etc. There are many special FTS operators that should cover all needs, including semantic search. The FTS engine should automatically handle the different formats and you shouldn't have to unzip the files.
    4. This should also work with other file formats, such as earlier versions of Word and PDF files.

    Disclaimer - I believe the above is correct but I haven't tried it.

     

    • This reply was modified 5 years, 6 months ago by  William Rayer.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply