SQLServerCentral Article

Unzipping Word Documents in T-SQL

,

In the databases I am currently working with, there are some tables containing Word documents stored as binaries and I was wondering if it would be possible to perform some server-side processing on these documents.

Modern Word documents (.docx) are nothing more than zip files and SQL Server 2016 introduced the COMPRESS and DECOMPRESS functions, which understand the GZIP format. It turns out that the various XML files that make a Word document are compressed as GZIP files (while other files, like images, are uncompressed).

We thus need to understand the binary format of a ZIP file, write some T-SQL code that retrieve the individual elements of the file, and use DECOMPRESS to uncompress them. I found easy to understand documentation about the ZIP file at this link. It allowed me to write the following table-value function that returns one record per element of the Word document:

CREATE OR ALTER FUNCTION [dbo].[GetZipContent](@B VARBINARY(MAX))
RETURNS @Result TABLE (Filename VARCHAR(255) PRIMARY KEY, Content VARBINARY(MAX), Status VARCHAR(20)) AS BEGIN
DECLARE 
    @I INT, @II INT,
    @TotalNumber INT,
    @CurrentNumber INT,
    @CentralDirectoryOffset BIGINT, 
    @CompressionMethod INT, --0 (none) or 8 (deflate)
    @CRCOfUncompressedData VARBINARY(4),
    @CompressedSize BIGINT,
    @UnompressedSize VARBINARY(4),
    @FileNameLength INT,
    @ExtraFieldLength INT,
    @FileCommentLength INT,
    @FileHeaderOffset BIGINT,
    @FileName VARCHAR(255)

--Test if valid zip file
IF @B IS NULL OR LEN(@B) < 4 OR SUBSTRING(@B, 1, 1) <> 0x50 OR SUBSTRING(@B, 2, 1) <> 0x4b RETURN

/*
Find End of Central Directory
*/
--Signature(0x06054b50)
SET @I = LEN(@B) - 4
WHILE @I > 0 BEGIN
    IF SUBSTRING(@B, @I, 1) = 0x50 AND SUBSTRING(@B, @I + 1, 1) = 0x4b AND SUBSTRING(@B, @I + 2, 1) = 0x05 AND SUBSTRING(@B, @I + 3, 1) = 0x06 BREAK
    SET @I = @I - 1
END
IF @I = 0 RETURN --No EOCD found 
SET @I = @I + 4
SET @I = @I + 2 --Number of this disk
SET @I = @I + 2 --Disk where central directory starts
SET @I = @I + 2 --Numbers of central directory records on this disk
SET @TotalNumber = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1)
SET @I = @I + 2
SET @I = @I + 4 --Size of central directory in bytes
SET @CentralDirectoryOffset = 16777216 * SUBSTRING(@B, @I + 3, 1) + 65536 * SUBSTRING(@B, @I + 2, 1) + 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1)
SET @I = @I + 4
--SET @I = @I + 2 --Comment length
--SET @I = @I + n --Comment

/*
LOOP THRU Central Directory
*/
SET @I = @CentralDirectoryOffset + 1; --TSQL is 1-based
SET @CurrentNumber = 1
WHILE @CurrentNumber <= @TotalNumber BEGIN
    SET @CurrentNumber = @CurrentNumber + 1
    SET @I = @I + 4 --Signature(0x02014b50)
    SET @I = @I + 2 --Version made by
    SET @I = @I + 2 --Minimum version needed to extract
    SET @I = @I + 2 --Bit flag
    --Compression method
    SET @CompressionMethod = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1)
    SET @I = @I + 2
    SET @I = @I + 2 --File last modification time(MS - DOS format)
    SET @I = @I + 2 --File last modification date(MS - DOS format)
    --CRC - 32 of uncompressed data
    SET @CRCOfUncompressedData = SUBSTRING(@B, @I, 4)
    SET @I = @I + 4
    --Compressed size
    SET @CompressedSize = 16777216 * SUBSTRING(@B, @I + 3, 1) + 65536 * SUBSTRING(@B, @I + 2, 1) + 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1)
    SET @I = @I + 4
    --Uncompressed size
    SET @UnompressedSize = SUBSTRING(@B, @I, 4)
    SET @I = @I + 4
    --File name length(n)
    SET @FileNameLength = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1)
    SET @I = @I + 2;
    --Extra field length(m)
    SET @ExtraFieldLength = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1)
    SET @I = @I + 2;
    --File comment length(k)
    SET @FileCommentLength = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1)
    SET @I = @I + 2;
    SET @I = @I + 2 --Disk number where file starts
    SET @I = @I + 2 --Internal file attributes
    SET @I = @I + 4 --External file attributes
    --Offset of local file header(from start of disk)
    SET @FileHeaderOffset = 16777216 * SUBSTRING(@B, @I + 3, 1) + 65536 * SUBSTRING(@B, @I + 2, 1) + 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1)
    SET @I = @I + 4
    --File name
    SET @FileName = SUBSTRING(@B, @I, @FileNameLength)
    SET @I = @I + @FileNameLength
    SET @I = @I + @ExtraFieldLength --Extra field
    SET @I = @I + @FileCommentLength --File comment

    /*
    Skip file header then access file data
    */
    --@I points at the next central directory item. Do not modify it
    SET @II = @FileHeaderOffset + 1 --TSQL is 1-based
    SET @II = @II + 4 --Signature(0x04034b50)
    SET @II = @II + 2 --Minimum version needed to extract
    SET @II = @II + 2 --Bit flag
    SET @II = @II + 2 --Compression method
    SET @II = @II + 2 --File last modification time(MS - DOS format)
    SET @II = @II + 2 --File last modification date(MS - DOS format)
    SET @II = @II + 4 --CRC - 32 of uncompressed data
    SET @II = @II + 4 --Compressed size
    SET @II = @II + 4 --Uncompressed size
    --File name length(n)
    SET @FileNameLength = 256 * SUBSTRING(@B, @II + 1, 1) + SUBSTRING(@B, @II, 1)
    SET @II = @II + 2;
    --Extra field length(m)
    SET @ExtraFieldLength = 256 * SUBSTRING(@B, @II + 1, 1) + SUBSTRING(@B, @II, 1)
    SET @II = @II + 2;
    SET @II = @II + @FileNameLength --File Name
    SET @II = @II + @ExtraFieldLength --Extra Field

    IF @CompressionMethod = 0
        INSERT INTO @Result SELECT @FileName, SUBSTRING(@B, @II, @CompressedSize), 'Not compressed'
    ELSE IF @CompressedSize = 0
        INSERT INTO @Result SELECT @FileName, NULL, 'Empty'
    ELSE --wrap compress data in GZIP header-footer and decompress
        INSERT INTO @Result SELECT @FileName, DECOMPRESS(0x1f8b0800000000000400 + SUBSTRING(@B, @II, @CompressedSize) + @CRCOfUncompressedData + @UnompressedSize), 'Success'
END
RETURN
END

Here is an example of how to call GetZipContent.

DECLARE @DocumentContent VARBINARY(MAX) = (
    SELECT BulkColumn
    FROM OPENROWSET(BULK N'C:\Temp\allo.docx', SINGLE_BLOB) AS Document
)
SELECT Filename, CONVERT(XML, Content) AS XmlContent, Status FROM GetZipContent(@DocumentContent)

The result is:

Filename                       XmlContent                                              Status
------------------------------ ------------------------------------------------------- -------
[Content_Types].xml            <Types xmlns="http://schemas.openxmlformats.org/...     Success
_rels/.rels                    <Relationships xmlns="http://schemas.openxmlform...     Success
docProps/app.xml               <Properties xmlns="http://schemas.openxmlformats...     Success
docProps/core.xml              <cp:coreProperties xmlns:cp="http://schemas.open...     Success
word/_rels/document.xml.rels   <Relationships xmlns="http://schemas.openxmlform...     Success
word/document.xml              <w:document xmlns:wpc="http://schemas.microsoft....     Success
word/fontTable.xml             <w:fonts xmlns:mc="http://schemas.openxmlformats...     Success
word/settings.xml              <w:settings xmlns:mc="http://schemas.openxmlform...     Success
word/styles.xml                <w:styles xmlns:mc="http://schemas.openxmlformat...     Success
word/theme/theme1.xml          <a:theme xmlns:a="http://schemas.openxmlformats....     Success
word/webSettings.xml           <w:webSettings xmlns:mc="http://schemas.openxmlf...     Success

GetZipContent successfully uncompressed over 10k documents contained in my databases. It can certainly be enhanced to handle generic zip files but no one in its right mind would contemplate the idea of natively unzipping files on a SQL Server!

Rate

5 (4)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (4)

You rated this post out of 5. Change rating