Reading data from 0x05

Question

Reading data from 0x05

LS

Old Hand

Points: 331
More actions
May 20, 2019 at 1:54 pm

#3645366

Hi
We currently have a table, with a column of a varbinary(MAX).
This contains data which precedes with 0x50 – from what I understand means that the data has a ZIP format and I believe it has been compressed.
We need to be able to read this data. I know that this is a .docx document.
Any help is much appreciated.

Many Thanks
Elle

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Eric M Russell SSC Guru Points: 125562 More actions · Answer 1

To reduce file size, MS Word documents are compressed using PKZIP format (a form of LZ compression). You can leverage BCP to export contents of a VARBINARY column to a file. In this case, you would be exporting contents from a specific row into a file with .docx extension.

http://www.sqlusa.com/bestpractices2005/importimage/

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Phil Parkin SSC Guru Points: 246979 More actions · Answer 2

What do you wish to read it with, or how do you wish to process it?

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

LS Old Hand Points: 331 More actions · Answer 3

Thanks for your replies.

We need to read it using either a T-SQL script or SAP Data Services.

The data will not be extracted in bulk but extracted each hour.

Phil Parkin SSC Guru Points: 246979 More actions · Answer 4

Once the varbinary data has been SELECTed, what do you want to happen to it next.

Using poorly defined terms like 'read' and 'extract' does not make it clear exactly what you are trying to do.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

LS Old Hand Points: 331 More actions · Answer 5

Sorry Phil.

Thanks, we need to loop/search through the SELECTed data and pick out specific items of data.

The .docx has been created like a form.

Phil Parkin SSC Guru Points: 246979 More actions · Answer 6

So you want to search the text embedded inside a DOCX file that has been stored inside a VARBINARY column, is that right?

So, notionally, like this?

SELECT cols 
FROM tbl
WHERE VarCol like '%free beer%'

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

LS Old Hand Points: 331 More actions · Answer 7

LS

Old Hand

Points: 331

May 21, 2019 at 3:13 pm

#3645753

Yes that's correct.

Phil Parkin SSC Guru Points: 246979 More actions · Answer 8

There is no fast way of doing this. As Eric mentioned, DOCX files are zipped XML files.

Searching through them would involve unzipping them first.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

andycadley SSCertifiable Points: 5296 More actions · Answer 9

SQL is probably the wrong tool for this. Not only do you need to unzip the data, but you then need to navigate the internal file structure, extract the relevant XML data and then search it for text data - it might be possible, but it sure won't be pretty.

You're probably better off using a tool written in VB or C#, Microsoft provide libraries for manipulating Office documents that make the whole process vastly less painful.

LS Old Hand Points: 331 More actions · Answer 10

Thanks for your input.. much appreciated. I'll take a look into these methods.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 11

The fact that the documents are contained inside a database column is actually working against you. If the documents were contained in a folder, then several options would be available for searching: SQL Server Full Text Search, Azure Search, etc. The open source tool dnGrep can search inside .zip, .docx, and .pdf files. It appears to support command line parameters, so if you can extract the contents of the binary object to a file, then perhaps this tool can be leveraged. How practical this would be depends on how many documents you need to scan at a time.

https://github.com/dnGrep/dnGrep/wiki/Command-Line

Ideally, whatever application initially inserts the documents into the database should be doing some keyword indexing upfront. You need a table to link each document to instances of your common search terms.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

William Rayer SSC Eights! Points: 865 More actions · Answer 12

In theory you could do the following:

Set up a FILESTREAM column to contain the DOCX files. This column could be in the existing table or you could create a new table just to store the DOCX files. Copy all the files into the FILESTREAM column, this stores them in the underlying file system but they are still part of the database and you can process them using TSQL.
Enable Full Text Search (FTS) on your SQL Server.
You can then search the DOCX files using the FTS operators such as CONTAINS, FREETEXT etc. There are many special FTS operators that should cover all needs, including semantic search. The FTS engine should automatically handle the different formats and you shouldn't have to unzip the files.
This should also work with other file formats, such as earlier versions of Word and PDF files.

Disclaimer - I believe the above is correct but I haven't tried it.

This reply was modified 5 years, 6 months ago by William Rayer.