In the databases I am currently working with, there are some tables containing Word documents stored as binaries and I was wondering if it would be possible to perform some server-side processing on these documents.
Modern Word documents (.docx) are nothing more than zip files and SQL Server 2016 introduced the COMPRESS and DECOMPRESS functions, which understand the GZIP format. It turns out that the various XML files that make a Word document are compressed as GZIP files (while other files, like images, are uncompressed).
We thus need to understand the binary format of a ZIP file, write some T-SQL code that retrieve the individual elements of the file, and use DECOMPRESS to uncompress them. I found easy to understand documentation about the ZIP file at this link. It allowed me to write the following table-value function that returns one record per element of the Word document:
CREATE OR ALTER FUNCTION [dbo].[GetZipContent](@B VARBINARY(MAX)) RETURNS @Result TABLE (Filename VARCHAR(255) PRIMARY KEY, Content VARBINARY(MAX), Status VARCHAR(20)) AS BEGIN DECLARE @I INT, @II INT, @TotalNumber INT, @CurrentNumber INT, @CentralDirectoryOffset BIGINT, @CompressionMethod INT, --0 (none) or 8 (deflate) @CRCOfUncompressedData VARBINARY(4), @CompressedSize BIGINT, @UnompressedSize VARBINARY(4), @FileNameLength INT, @ExtraFieldLength INT, @FileCommentLength INT, @FileHeaderOffset BIGINT, @FileName VARCHAR(255) --Test if valid zip file IF @B IS NULL OR LEN(@B) < 4 OR SUBSTRING(@B, 1, 1) <> 0x50 OR SUBSTRING(@B, 2, 1) <> 0x4b RETURN /* Find End of Central Directory */ --Signature(0x06054b50) SET @I = LEN(@B) - 4 WHILE @I > 0 BEGIN IF SUBSTRING(@B, @I, 1) = 0x50 AND SUBSTRING(@B, @I + 1, 1) = 0x4b AND SUBSTRING(@B, @I + 2, 1) = 0x05 AND SUBSTRING(@B, @I + 3, 1) = 0x06 BREAK SET @I = @I - 1 END IF @I = 0 RETURN --No EOCD found SET @I = @I + 4 SET @I = @I + 2 --Number of this disk SET @I = @I + 2 --Disk where central directory starts SET @I = @I + 2 --Numbers of central directory records on this disk SET @TotalNumber = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1) SET @I = @I + 2 SET @I = @I + 4 --Size of central directory in bytes SET @CentralDirectoryOffset = 16777216 * SUBSTRING(@B, @I + 3, 1) + 65536 * SUBSTRING(@B, @I + 2, 1) + 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1) SET @I = @I + 4 --SET @I = @I + 2 --Comment length --SET @I = @I + n --Comment /* LOOP THRU Central Directory */ SET @I = @CentralDirectoryOffset + 1; --TSQL is 1-based SET @CurrentNumber = 1 WHILE @CurrentNumber <= @TotalNumber BEGIN SET @CurrentNumber = @CurrentNumber + 1 SET @I = @I + 4 --Signature(0x02014b50) SET @I = @I + 2 --Version made by SET @I = @I + 2 --Minimum version needed to extract SET @I = @I + 2 --Bit flag --Compression method SET @CompressionMethod = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1) SET @I = @I + 2 SET @I = @I + 2 --File last modification time(MS - DOS format) SET @I = @I + 2 --File last modification date(MS - DOS format) --CRC - 32 of uncompressed data SET @CRCOfUncompressedData = SUBSTRING(@B, @I, 4) SET @I = @I + 4 --Compressed size SET @CompressedSize = 16777216 * SUBSTRING(@B, @I + 3, 1) + 65536 * SUBSTRING(@B, @I + 2, 1) + 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1) SET @I = @I + 4 --Uncompressed size SET @UnompressedSize = SUBSTRING(@B, @I, 4) SET @I = @I + 4 --File name length(n) SET @FileNameLength = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1) SET @I = @I + 2; --Extra field length(m) SET @ExtraFieldLength = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1) SET @I = @I + 2; --File comment length(k) SET @FileCommentLength = 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1) SET @I = @I + 2; SET @I = @I + 2 --Disk number where file starts SET @I = @I + 2 --Internal file attributes SET @I = @I + 4 --External file attributes --Offset of local file header(from start of disk) SET @FileHeaderOffset = 16777216 * SUBSTRING(@B, @I + 3, 1) + 65536 * SUBSTRING(@B, @I + 2, 1) + 256 * SUBSTRING(@B, @I + 1, 1) + SUBSTRING(@B, @I, 1) SET @I = @I + 4 --File name SET @FileName = SUBSTRING(@B, @I, @FileNameLength) SET @I = @I + @FileNameLength SET @I = @I + @ExtraFieldLength --Extra field SET @I = @I + @FileCommentLength --File comment /* Skip file header then access file data */ --@I points at the next central directory item. Do not modify it SET @II = @FileHeaderOffset + 1 --TSQL is 1-based SET @II = @II + 4 --Signature(0x04034b50) SET @II = @II + 2 --Minimum version needed to extract SET @II = @II + 2 --Bit flag SET @II = @II + 2 --Compression method SET @II = @II + 2 --File last modification time(MS - DOS format) SET @II = @II + 2 --File last modification date(MS - DOS format) SET @II = @II + 4 --CRC - 32 of uncompressed data SET @II = @II + 4 --Compressed size SET @II = @II + 4 --Uncompressed size --File name length(n) SET @FileNameLength = 256 * SUBSTRING(@B, @II + 1, 1) + SUBSTRING(@B, @II, 1) SET @II = @II + 2; --Extra field length(m) SET @ExtraFieldLength = 256 * SUBSTRING(@B, @II + 1, 1) + SUBSTRING(@B, @II, 1) SET @II = @II + 2; SET @II = @II + @FileNameLength --File Name SET @II = @II + @ExtraFieldLength --Extra Field IF @CompressionMethod = 0 INSERT INTO @Result SELECT @FileName, SUBSTRING(@B, @II, @CompressedSize), 'Not compressed' ELSE IF @CompressedSize = 0 INSERT INTO @Result SELECT @FileName, NULL, 'Empty' ELSE --wrap compress data in GZIP header-footer and decompress INSERT INTO @Result SELECT @FileName, DECOMPRESS(0x1f8b0800000000000400 + SUBSTRING(@B, @II, @CompressedSize) + @CRCOfUncompressedData + @UnompressedSize), 'Success' END RETURN END
Here is an example of how to call GetZipContent.
DECLARE @DocumentContent VARBINARY(MAX) = ( SELECT BulkColumn FROM OPENROWSET(BULK N'C:\Temp\allo.docx', SINGLE_BLOB) AS Document ) SELECT Filename, CONVERT(XML, Content) AS XmlContent, Status FROM GetZipContent(@DocumentContent)
The result is:
Filename XmlContent Status ------------------------------ ------------------------------------------------------- ------- [Content_Types].xml <Types xmlns="http://schemas.openxmlformats.org/... Success _rels/.rels <Relationships xmlns="http://schemas.openxmlform... Success docProps/app.xml <Properties xmlns="http://schemas.openxmlformats... Success docProps/core.xml <cp:coreProperties xmlns:cp="http://schemas.open... Success word/_rels/document.xml.rels <Relationships xmlns="http://schemas.openxmlform... Success word/document.xml <w:document xmlns:wpc="http://schemas.microsoft.... Success word/fontTable.xml <w:fonts xmlns:mc="http://schemas.openxmlformats... Success word/settings.xml <w:settings xmlns:mc="http://schemas.openxmlform... Success word/styles.xml <w:styles xmlns:mc="http://schemas.openxmlformat... Success word/theme/theme1.xml <a:theme xmlns:a="http://schemas.openxmlformats.... Success word/webSettings.xml <w:webSettings xmlns:mc="http://schemas.openxmlf... Success
GetZipContent successfully uncompressed over 10k documents contained in my databases. It can certainly be enhanced to handle generic zip files but no one in its right mind would contemplate the idea of natively unzipping files on a SQL Server!