There have been quite a few articles about the new CLR (.Net Common Language Runtime) that has been integrated into MS SQL Server 2005. Most of the articles focus on creating .Net-based user-defined functions such as Regular Expression functions (which, of course, is a great use of the CLR). However, an area that is not talked about nearly as much is the ability to create your own Aggregate functions and Types. There are probably three main reasons for this: 1) complexity in creating and using, 2) fewer instances of true need, and 3) impractical size limitations.
So, assuming you can find an occasion to use your own Type or Aggregate you might very well run into a wall regarding the amount of data that it can hold. While this situation will not rear its ugly little head in simple operations such as an average that only needs to hold two values--SUM of values and COUNT of values--it will make itself quite the nuisance when trying to do an operation that requires storing a larger set of data such as an array.
For this article I will focus on creating a User-Defined Aggregate function to determine the Median value in a group. The Median value is defined as the middle value in an odd-numbered group or the average of the two middle values in an even-numbered group. In order to accomplish this you need to have the entire group of numbers at your disposal to find the middle / median, since the median value is a function of its position within the set of numbers, rather than a mathematically derived function. This presents a problem when working with the new CLR User-Defined Aggregates as they need to serialize their information to memory inbetween each row being processed. As stated before, while this is not a problem when serializing an INT value (for COUNT of values) and a FLOAT (for SUM of values) in order to compute the average, it is a problem when storing an array that will contain one entry for every row in the group. The maximum size for the serialized memory of a User-Defined Type or User-Defined Aggregates is 8000 bytes (sounds familiar, doesn't it?). Let's see what this means for our intended function: Median.
If we start with a simple set of five numbers (1, 5, 72, 1004, 32767) then the median value is 72 as it is simply the middle value. We can add a number to the set and have six numbers (1, 5, 72, 1004, 32767, 56123) and now the median is 538, which is the average of the two middle values: 72 and 1004. As you can see, we will need the entire set of numbers to first sort and then find the middle value (or two values). Now that we know what we need to accomplish, why is the 8000 byte limit a problem?
Well, if we need to store each value then we need to know how much space each value takes. A FLOAT datatype takes up 8 bytes. Why use FLOAT instead of INT? Because we don't know for certain which fields our Aggregate will be used on and want the flexibility to use it on any numeric field. And even if INT was used it would still have the same problem of space limitation, it just would not show up as quickly. So, we have 8000 bytes to fit an array that stores 8 byte values. This gives us a maximum of 1000 values. A technicality in the binary serialization process requires that we store the count of how many values we are storing since we cannot natively store an array. Therefore we need to use an INT (well, Int32 in .Net) first that contains the total COUNT of values and then we can store our 8 byte values. This leaves space for only 999 values. Hence, if we use this Median Aggregate over a set that contains 1000 or more values, then the Aggregate function will throw an error. Can anyone guarantee that they will never have a grouping of rows that will never exceed 999 values? I highly doubt it. So does that make trying to create this type of Aggregate useless? Maybe not.
One of the benefits of working in the .Net CLR world is that you have most of the .Net functionality at your disposal. This gives us the ability to use compression that has been built in to the .Net framework to try to squeeze as much as possible into the serialized binary data. The effectiveness of this method will vary based on the set of data used due to the nature of compression. However, even with a relatively non-compressible dataset, compression will still help to break the 999 record limit for most groups of data. Rather than give a piece by piece description of the Aggregate function, I will instead give the entire function at once while providing comments in the code to explain what is going on.
using System; /* contains: Serializable */ using System.Collections; /* contains: ArrayList */ using System.Collections.Generic; /* contains: definition of List */ using System.Data.SqlTypes; /* contains: definition of SqlDouble (FLOAT in T-SQL) */ using System.IO; /* contains: BinaryReader and BinaryWriter */ using System.IO.Compression; /* contains: DeflateStream and InflateStream */ using Microsoft.SqlServer.Server; /* contains: SqlUserDefinedAggregate */ [Serializable] [SqlUserDefinedAggregate( Format.UserDefined,/* Native is used when storing only native datatype variables; * UserDefined is used when storing more than native datatype * variables; we are storing an Array so we have to use * UserDefined and then implement the Read and Write methods * (below) */ IsInvariantToNulls = true, /* whether or not NULL values affect the result; * in our case they do not so this method is invariant to NULLs */ IsInvariantToDuplicates = false, /* whether or not duplicate values affect the result; * in our case they do so this is NOT invariant to * duplicates */ IsInvariantToOrder = true, /* whether or not the order of the rows affects the result; * in our case they do not */ IsNullIfEmpty = true, /* whether or not an empty result set produces NULL or a value; * SUM produces NULL with no results but COUNT produces 0; * in our case Median will be NULL just like SUM and AVG */ MaxByteSize = 8000, /* 8000 is the max and we will use all of it */ Name = "Agg_Median" /* sets the name when used with the Deploy feature in Visual * Studio 2005 */ ) ] public class Agg_Median : IBinarySerialize /* extend the IBinarySerialize interface since a * "Format" of "UserDefined" was chosen. This will * require implimenting the "Read" and "Write" * methods. */ { private List<Double> ValueList; /* create the local variable for the entire set of * values */ public void Init() { this.ValueList = new List<Double>(); /* initialize the variable */ } public void Accumulate(SqlDouble value) /* this method adds a new row's value to the * existing list */ { if (value.IsNull) { return; /* if this row's value IS NULL, do NOT add it to the internal list; * just exit */ } this.ValueList.Add(value.Value); /* this row's value is not NULL so add it to the * internal list */ } public void Merge(Agg_Median other) /* combine data from parallel operations */ { this.ValueList.AddRange(other.ValueList); } public SqlDouble Terminate() /* this method is called at the end to return the result */ { double __Median = 0.0D; /* local variable to hold and calculate final result */ int __Count = 0; /* local variable to hold number of values in list */ this.ValueList.Sort(); /* arrange the list of values in order so we can determine * middle value(s) */ __Count = this.ValueList.Count; /* we need to know how many in order to find the middle */ /* if there is at least one value we will calculate the Median, else we return a NULL FLOAT */ if (__Count > 0) { if (__Count % 2 == 0) /* if even number of values, middle will be two values */ { __Median = ( ((double)this.ValueList[(__Count / 2) - 1] + (double)this.ValueList[(__Count / 2)]) / 2.0 ); /* average the two middle values */ } else /* there are an odd number of values so there is only one middle value */ { __Median = (double)this.ValueList[Convert.ToInt32(Math.Floor(__Count / 2.0))]; } /* send the final result back to SQL Server as a FLOAT (same as SqlDouble in .Net) */ return new SqlDouble(__Median); } else { return SqlDouble.Null; } } public void Read(BinaryReader Serialized) { /* we need to impliment the "Read" method since we chose a "Format" of "UserDefined" * in the main declaration above. The "Read" method expects to get passed in a * BinaryReader object which we will use to extract the serialized data out of memory */ BinaryReader __FromSerialize; /* variable to hold decompressed serialized binary data */ /* first read all of the serialized, compressed, binary data into a MemoryStream as Deflate * will read from a MemoryStream and not a BinaryReader */ using (MemoryStream __MemoryStreamIn = new MemoryStream ((byte[])Serialized.ReadBytes(Convert.ToInt32(Serialized.BaseStream.Length)))) { /* create a DeflateStream which will decompress the data now in the MemoryStream */ using (DeflateStream __DecompressStream = new DeflateStream(__MemoryStreamIn, CompressionMode.Decompress)) { /* create another MemoryStream which will hold the decompressed binary data */ using (MemoryStream __MemoryStreamOut = new MemoryStream ()) { /* byte by byte read the decompressed binary data /* into the MemoryStream */ */ for (int __Byte = __DecompressStream.ReadByte(); __Byte != -1; __Byte = __DecompressStream.ReadByte()) { __MemoryStreamOut.WriteByte((byte)__Byte); } /* create a BinaryReader to extract the data from the MemoryStream */ __FromSerialize = new BinaryReader(__MemoryStreamOut); /* set the position from which to start reading to the beginning */ __FromSerialize.BaseStream.Position = 0; /* create a variable to hold the number of values in the list, which * is stored as the first few bytes of the binary data. remember, the * binary data does not contain its own datatypes; it is just a series * of bytes that we have to know ahead of time how to read. since we * created the binary stream in the "Write" method, we know that the * a 32-bit Integer (same as INT datatype in SQL) so we call the * first 4 bytes are ReadInt32() method to get that value. */ int __CountDoubles = __FromSerialize.ReadInt32(); /* if number of values = -1 then there is nothing to do */ if (__CountDoubles > -1) { /* create a variable to hold the list of values setting the * size of the array to be the number of values just extracted */ this.ValueList = new List<Double>(__CountDoubles); /* use the number of values that we just extracted to loop * through the series of bytes to retrieve the exact number * of values, each into their own array value. */ for (int __Index = 0; __Index this.ValueList.Add(__FromSerialize.ReadDouble()); } } } } } } public void Write(BinaryWriter ToBeSerialized) { /* we need to impliment the "Write" method since we chose a "Format" of "UserDefined" * in the main declaration above. The "Write" method expects to get passed in a * BinaryWriter object which we will use to store the serialized data into memory */ MemoryStream __ToCompress = new MemoryStream (); /* MemoryStream that will hold the binary * data that we will eventually compress */ BinaryWriter __BinaryWriter = new BinaryWriter(__ToCompress); /* the BinaryWriter will * allow for the writing of * data to the MemoryStream * just created */ __BinaryWriter.Write(this.ValueList.Count); /* First record into the binary data the * number of records in our array. This * will allow later retrieving of this value * first in the "Read" method so that we * can then read the correct number of * records out of the binary data */ this.ValueList.Sort(); /* next sort the data in the array before writing to the * BinaryWriter. This provides a huge help when compressing * data as it allows the compression algorithm to find patterns * more easily. It will not always be possible to include this * Sort() since some functions require a particular order of values * in the Array. For example, if creating a User-Defined Type to * emulate an Array, the order of the Array is always important * so that cannot be sorted. But fortunately for the Median * function the order in which the values were added is not * important (remember the IsInvariantToOrder = true option * in the main declaration at the top). */ /* cycle through the list of values and place each one into the MemoryStream using * the BinaryWriter. */ foreach (double __TempDouble in this.ValueList) { __BinaryWriter.Write(__TempDouble); } /* create a new MemoryStream (to be used temporarily) that will hold the compressed * binary data from the MemoryStream we used previously. */ using (MemoryStream __ToSerialize = new MemoryStream ()) { /* create a temporary object that will compress into the MemoryStream we just created * the binary MemoryStream where we wrote that values. */ using (DeflateStream __CompressStream = new DeflateStream(__ToSerialize, CompressionMode.Compress, true)) { /* convert the initial binary MemoryStream into a format (a byte Array) * that the compression functionality wants to read. */ byte[] __TempArray = (byte[])__ToCompress.ToArray(); /* Take the converted MemoryStream and compress it */ __CompressStream.Write(__TempArray, 0, __TempArray.Length); } /* the final result is the compressed MemoryStream that we will now write * into memory via the object passed into this method. */ ToBeSerialized.Write(__ToSerialize.ToArray()); } } }
In order to run this code, you first need to compile it into a DLL and then you need to load this DLL into SQL Server 2005 as an Assembly. Finally you create the User-Defined Aggregate that will point to the Assembly you just loaded.
Or, simply run this script that will create the Assembly and then create the User-Defined Aggregate for you so that you can move on with using it 8^):
CreateAssemblyAndAggregate.sql
Now that you have this created, let's test it out. Try running the following which is based on the two sets of values that I mentioned towards the beginning:
FROM (
SELECT 1 AS 'val'
UNION ALL
SELECT NULL AS 'val'
UNION ALL
SELECT 2 AS 'val'
UNION ALL
SELECT 32767 AS 'val'
UNION ALL
SELECT 72 AS 'val'
UNION ALL
SELECT 1004 AS 'val'
) list
-- 72
SELECT dbo.Agg_Median(list.val)
FROM (
SELECT 1 AS 'val'
UNION ALL
SELECT NULL AS 'val'
UNION ALL
SELECT 2 AS 'val'
UNION ALL
SELECT 32767 AS 'val'
UNION ALL
SELECT 72 AS 'val'
UNION ALL
SELECT 1004 AS 'val'
UNION ALL
SELECT 56123 AS 'val'
) list
-- 538
Now that you have seen the basics of the Median function, let's see if we can get more than 999 values into it:
COUNT(*) AS 'RecordsPerCardType',
COUNT(DISTINCT ExpMonth) AS 'UniqueMonthsPerCardType',
dbo.Agg_Median(ExpMonth) AS 'MedianMonth'
FROM AdventureWorks.Sales.CreditCard
GROUP BY CardType
CardType | RecordsPerCardType | UniqueMonthsPerCardType | MedianMonth |
ColonialVoice | 4782 | 12 | 7 |
Distinguish | 4832 | 12 | 7 |
SuperiorCard | 4839 | 12 | 7 |
Vista | 4665 | 12 | 6 |
Obviously we have broken the 999 records limit as in the above results we can see four groups of records containing between 4665 and 4839 records each. However, while compression has allowed us to break the limit, it has not removed limits from the memory entirely. Compression works best by finding patterns in data; the fewer the patterns the less the compression. If a group of data contains only a few unique values, then many thousands of records can be processed. However, if most of the records are unique that will severely limit the effectiveness of the compression. For example, we can slightly modify the query above to work across the entire set of CreditCards in that table:
COUNT(DISTINCT ExpMonth) AS 'UniqueMonths',
dbo.Agg_Median(ExpMonth) AS 'MedianMonth'
FROM AdventureWorks.Sales.CreditCard
TotalRecords | UniqueMonths | MedianMonth |
19118 | 12 | 7 |
We just operated over a group of data with over 19,000 records using the .Net CLR compression. That is well above the native 999 record limit, but we can also see that there are only 12 unique values (to be expected when dealing with months) and not all datasets are as "boring". Let us try using another table from the AdventureWorks database that has a lot more rows:
COUNT(DISTINCT UnitPrice) AS 'UniquePrices',
AVG(UnitPrice) AS 'AveragePrice'
FROM AdventureWorks.Sales.SalesOrderDetail
TotalRecords | UniquePrices | AveragePrice |
121317 | 287 | 465.0934 |
121,000 plus records is quite a lot and we can also see that the built-in AVG function works just fine. Will our new Median function work?
COUNT(DISTINCT UnitPrice) AS 'UniquePrices',
dbo.Agg_Median(UnitPrice) AS 'MedianPrice'
FROM AdventureWorks.Sales.SalesOrderDetail
A .NET Framework error occurred during execution of user-defined routine or aggregate "Agg_Median":
System.Data.SqlTypes.SqlTypeException: The buffer is insufficient. Read or write operation failed.
System.Data.SqlTypes.SqlTypeException:
at System.Data.SqlTypes.SqlBytes.Write(Int64 offset, Byte[] buffer, Int32 offsetInBuffer, Int32 count)
at System.Data.SqlTypes.StreamOnSqlBytes.Write(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.BinaryWriter.Write(Byte[] buffer)
at Agg_Median.Write(BinaryWriter ToBeSerialized)
Unfortunately, for a dataset of this size with as many unique values as it has, the Median function cannot work over the entire dataset even with compression. I highlighted in the above error the fact that in the Write method (where we were serializing the compressed Array of Doubles) we got an error saying that there was not enough Buffer space. Unfortunately, there is not much we can do at this point since the data has already been compressed as much as possible and no matter what we do we have that 8000 byte limit. But we were at least able to stretch this as far as it could go. How many records can we get with this particular dataset?
COUNT(DISTINCT UnitPrice) AS 'UniquePrices',
dbo.Agg_Median(UnitPrice) AS 'MedianPrice'
FROM AdventureWorks.Sales.SalesOrderDetail
WHERE SalesOrderId < 57123
TotalRecords | UniquePrices | MedianPrice |
62250 | 281 | 200.052 |
We now have over 62,000 records while the dataset has 281 unique values in it. That is roughly 62 times more data than we could have worked with if we hadn't used the compression. Hopefully it stretches the limit enough to work in the situations that call for the use of a CLR User-Defined Aggregate or User-Defined Type. If not then at least you have a better understanding of what you can really do with the .Net CLR.
You can get the Median function and over 50 others by downloading a library that I created called SQL# (yes, like C# but for T-SQL). The majority of the functions are free and as time permits I am always adding more. You can download this from the SQL# website at: http://www.SQLsharp.com/ .