Binary objects and unstructured data
In the past, my opinion was that a BLOB or Binary Large Object really had no
place in a relational database. My
experiences with these image & binary data types are that they are slow
performing, and a pain to work with in T-SQL. Any operations working with these BLOBs were usually performed in a separate business layer outside of SQL. This suggested to me it made more sense to put things like files or images where they were designed to be placed - in the file system.
BLOBs tax relational database performance, and though there are many ways to
mitigate this I still believe it is better to have a pointer to the data rather
than storing the data itself. However, sometimes you cannot get away from using a BLOB, or inheriting it from another architecture. The interesting thing about BLOBs is that they are slowly taking over the database world.The offset of this is that more tools are being provided to allow easier manipulation of this data.
This article will describe some of the new features of SQL Server 2005 and SQL Server 2008, and provide an example of pulling a BLOB out of an image column in SQL Server 2005.
Bacronyms and marketing speak
In my last article, Why SQL Server is a 4-letter word, I discussed acronyms. The term blob is a bacronym. This is an acronym that was constructed after-the-fact to cover up that it really was named after "The Blob that ate Cincinnati." The person who first applied the name Blob to a database object was Jim Starkey. He tells a hilarious Dilbert-ish story about the origins of the term here, and how marketers like acronyms.
Acronyms are professional and great for marketing buzz, yet they feel a bit impersonal to me. Would you rather read about the rich-media framework from Microsoft called WPF/E or a cool new technology called Microsoft Silverlight? Which letters makes you feel warmer and fuzzier when they are put
together, I-B-M or A-p-p-l-e?
That's a lot of rows
A problem I had recently was with a 3rd-party financial system. The vendor provides a proprietary database with a table containing over 300,000 records, 2.8GB of data, and one tiny clustered index on a unique identifier column. Of the 16 columns, it is a column of IMAGE type which stores the most data. Digging further into the column, it is storing a binary representation of 5000 4-byte float or money data type values. New tables are created after each execution of the tool with a unique identifier appended to the table name. The customer would like have all of the data available for querying at
any time after the table is loaded. Since this is 1,665,000,000 unfiltered rows per day there could be a problem with scalability and performance if the data is staged.
Not a lot for a cube
Developing an Analysis Services 2005 cube to aggregate the data would be the
recommended approach in this situation, since it could handle this scale of data and the combinations of queries given to it.However, the customer is most comfortable querying subsets of the data in SQL Server, and the timeframe does not allow for an OLAP-based approach.The interim
solution is to build a parameterized stored procedure and table function in SQL
Server 2005 to de-BLOB the BLOB and return a subset of the rows.
The T-SQL approach
In SQL Server 2000, there is an SDK utility called TextCopy that can manipulate
a BLOB within an image column.
However this is outside the SQL Server environment. In SQL Server 2005 you can use the Import Column data flow task in Integration Services. Again, this is outside of the Query Analyzer window. Within SQL Server, you can use the new VARBINARY(MAX) data type to convert the IMAGE data type to something workable.
Size up the Data
The first item of business is determining the size of the data column.
Since I know based on legacy code that the byte size is 4, I would like
to see the number of records stored in each column.
The function to use here is DATALENGTH.
SELECT convert(varbinary(MAX),value) as ColumnValue ,DATALENGTH(value) as Length , DATALENGTH(value) /4as RecordCount FROM [BigBlobTable]
Dividing the number by 4 gives us 5000 records in the column.
The Deblobber Table Function
I am still looking for a good way to remove the WHILE function, and the Table
Function too, however this does the trick for now.
The Deblobber Code
CREATE FUNCTION [dbo].[Deblobber] ( @blob AS VARBINARY(MAX) ) RETURNS @Values TABLE ( Value MONEY ) AS BEGIN DECLARE @length AS INT DECLARE @i AS INT SET @i = 1 SET @length = DATALENGTH(@blob) WHILE @i < @length BEGIN INSERT INTO @Values SELECT CAST(SUBSTRING(@blob, @i, 4) AS MONEY) SET @i = @i + 4 END RETURN END
The Deblobber table function is used in the stored procedure below. Names have been changed to protect the innocent. In no way should these names be used in a professional environment.
The Select Statement Code
CREATE PROCEDURE [dbo].[GetBigBlob] (@TableID AS VARCHAR(MAX) = '17627', @PickListID AS VARCHAR(MAX) = '12021855,12233944') AS -- Create a link to the table BigBlobTable_17627 IF EXISTS ( SELECT name FROM sys.synonyms WHERE Name = N'BigBlob' ) DROP SYNONYM N'BigBlob' EXEC ( ' create synonym N'BigBlob for N'BigBlobTable_' + @TableID + ']' ) -- Turn off row counts for performance SET nocount ON; -- Define a Common Table Expression (CTE) WITH BigBlobCTE as ( -- Define an auto-incrementing row identifier SELECT ROW_NUMBER() OVER ( ORDER BY [UniqueID] ASC ) AS CounterID, CONVERT(VARBINARY(MAX), value) AS ColumnValue FROM BigBlob -- Use COALESCE function to ignore NULL parameter values -- Use Split function to parse comma-delimited list of parameters WHERE (CHARINDEX(',',@PickListID ) = 0 AND COALESCE(@PickListID , PickListID ) = UniqueID) OR (CHARINDEX(',',@PickListID ) > 0 AND UniqueID IN (SELECT value FROM dbo.fnSplit(@PickListID,','))) ) -- Return values from the CTE SELECT resultValues.rowid, resultValues.PnLValue,BigBlobCTE.CounterID [positionUID], FROM BigBlobCTE -- JOIN to the User-Defined Table Function CROSS APPLY dbo.Deblobber(BigBlobCTE.ColumnValue) AS resultValues
The Results
The procedure above should, for each blob, return a table, union these tables
together, and return them as a result set with the original data row.
The key SQL Server 2005 features demonstrated in this stored procedure are:
- Synonyms – Friendly aliases for objects
- Common Table Expressions (CTEs) – WITH syntax presents a
table expression above the SQL SELECT statement rather than nested below in
a derived table. Used for easier readability and reuse of the expression.
- User-Defined Table (UDT) function Joins - CROSS APPLY
syntax allows UDTs to be joined to other tables.
- Dynamic Row Numbering - ROW_NUMBER() syntax provides a
dynamic row count over a given sort order.
- Large Binary data type - VARBINARY(MAX) removes 8000
character limitation of SQL 2000 and allows for manipulating Large Binary
Objects (LOBs) within SELECT statements.
Problems
There are a few potential issues with this solution.
- Concurrency
issues with the synonym could allow this procedure to run only once at a
time
- Pulling the data for a large number of records still
poses a performance issue
Dynamic SQL could solve the concurrency issue, however it was determined that it
will be run by a single user on an ad-hoc basis so it is not a priority.
Performance issues should be mitigated as there is no longer a need to pull all
data.Instead, data can be joined
and filtered.
Looking forward to SQL Server 2008
Microsoft is offering some potentially great enhancements to working with BLOB
data in SQL Server 2008, including the following:
- Integrated Full Text Search - no more service, it's
built-in.
- Sparse Columns - providing the ability to create millions
of BLOB columns.
- New Index Types - Increasing the performance of querying.
- Data compression - Shrinking the size and providing
various storage alternatives.
- And last but not least, afilestream datatype in SQL 2008.
For more information on these new enhancements, take a look at the white paper
released in August, 2007, entitled
Managing Unstructured Data with SQL Server 2008.
Could it be that the SQL
Storage Team borrowed some unstructured data ideas from someone on the
Author's Profile
Andrew Sears is a Senior
Solutions Consultant and Microsoft Certified Technology Specialist: Business
Intelligence Developer (MCITBID) at T4G Limited, with over 10 years experience
in the Analytics group implementing Reporting, Data Warehousing and Business
Intelligence Solutions.
is a leading
full-service, project-based technology services company and a member of the
Microsoft Gold Certified Partner Excellence Program.
Andrew currently maintains
various blogs on topics including
Silverlight,PerformancePoint,
and
Business Intelligence Solutions.