May 20, 2019 at 1:54 pm
Hi
We currently have a table, with a column of a varbinary(MAX).
This contains data which precedes with 0x50 – from what I understand means that the data has a ZIP format and I believe it has been compressed.
We need to be able to read this data. I know that this is a .docx document.
Any help is much appreciated.
Many Thanks
Elle
May 20, 2019 at 3:04 pm
To reduce file size, MS Word documents are compressed using PKZIP format (a form of LZ compression). You can leverage BCP to export contents of a VARBINARY column to a file. In this case, you would be exporting contents from a specific row into a file with .docx extension.
http://www.sqlusa.com/bestpractices2005/importimage/
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
May 20, 2019 at 5:37 pm
What do you wish to read it with, or how do you wish to process it?
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 21, 2019 at 8:52 am
Thanks for your replies.
We need to read it using either a T-SQL script or SAP Data Services.
The data will not be extracted in bulk but extracted each hour.
May 21, 2019 at 2:53 pm
Once the varbinary data has been SELECTed, what do you want to happen to it next.
Using poorly defined terms like 'read' and 'extract' does not make it clear exactly what you are trying to do.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 21, 2019 at 3:00 pm
Sorry Phil.
Thanks, we need to loop/search through the SELECTed data and pick out specific items of data.
The .docx has been created like a form.
May 21, 2019 at 3:06 pm
So you want to search the text embedded inside a DOCX file that has been stored inside a VARBINARY column, is that right?
So, notionally, like this?
SELECT cols
FROM tbl
WHERE VarCol like '%free beer%'
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 21, 2019 at 3:13 pm
Yes that's correct.
May 21, 2019 at 4:43 pm
There is no fast way of doing this. As Eric mentioned, DOCX files are zipped XML files.
Searching through them would involve unzipping them first.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 21, 2019 at 6:12 pm
SQL is probably the wrong tool for this. Not only do you need to unzip the data, but you then need to navigate the internal file structure, extract the relevant XML data and then search it for text data - it might be possible, but it sure won't be pretty.
You're probably better off using a tool written in VB or C#, Microsoft provide libraries for manipulating Office documents that make the whole process vastly less painful.
May 23, 2019 at 9:39 am
Thanks for your input.. much appreciated. I'll take a look into these methods.
May 23, 2019 at 1:23 pm
The fact that the documents are contained inside a database column is actually working against you. If the documents were contained in a folder, then several options would be available for searching: SQL Server Full Text Search, Azure Search, etc. The open source tool dnGrep can search inside .zip, .docx, and .pdf files. It appears to support command line parameters, so if you can extract the contents of the binary object to a file, then perhaps this tool can be leveraged. How practical this would be depends on how many documents you need to scan at a time.
https://github.com/dnGrep/dnGrep/wiki/Command-Line
Ideally, whatever application initially inserts the documents into the database should be doing some keyword indexing upfront. You need a table to link each document to instances of your common search terms.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
May 29, 2019 at 12:16 pm
In theory you could do the following:
Disclaimer - I believe the above is correct but I haven't tried it.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply