Introduction
Recently, I noticed a lot of questions about how SQL aggregate functions
handle NULL values on a discussion board for an article here at SQL Server
Central Understanding the difference between IS NULL and =NULL discussion board. This article was written to help answer some of those questions by explaining how SQL aggregate functions handle NULLs, and to describe some other "gotcha's" when dealing with SQL aggregate functions. Each section of the article ends with a summary of "gotcha's" to look out for when using these function.
The ANSI SQL-92 Standard defines five aggregate functions (which are also referred to in the SQL-92 standard as set functions.) SQL NULLs receive special internal handling that the database designer/programmer needs to be aware of when using these functions. This article describes how ANSI SQL-92 aggregate functions handle NULLs, per the ANSI SQL-92 standard.
I reference the ANSI SQL-92 Standard throughout the article, and the sample code requires the Microsoft SQL Server Northwind Sample Database to be run. All samples were run on SQL Server 2000.
Aggregate Functions
ANSI SQL-92 defines five aggregate functions. For all set functions that specify a column name, the following process takes place behind the scenes:
- The search expression (the WHERE clause) is applied.
- The value expression (scalar functions on the column like COALESCE())
is applied to each row of the results of Step 1.
- NULL values are eliminated from our results*.
*NOTE: COUNT(*) is a special case, which we describe in detail below.
ANSI SQL-92 specifies that a warning message will be returned when NULLs are eliminated by set functions. You can turn on
this option by using the SET ANSI_WARNINGS ON command. The SET ANSI_WARNINGS ON option returns the following error message
"Warning: Null value is eliminated by an aggregate or other set operation."
The SET ANSI_WARNINGS ON also affects
how SQL Server handles division by zero, arithmetic overflow and character/binary data truncation behavior. Before changing
your SET ANSI_WARNINGS option, be sure to reference the following MSDN article:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp
Gotcha: SET ANSI_WARNINGS ON affects several aspects of SQL Server behavior. Be sure to check http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp before changing this option. |
COUNT()
The COUNT() set function has two forms:
- COUNT(column) – Returns the count of rows in a table, less NULLs
- COUNT(*) – Returns the cardinality (total number of rows) of a table
The following code demonstrates the difference between COUNT(column) and COUNT(*):
SELECT COUNT(*) AS TotalRows, COUNT(region) AS NonNULLRows, COUNT(*) - COUNT(region) AS NULLRows FROM [Northwind].[dbo].[suppliers]
This query returns the following result:
TotalRows | NonNULLRows | NULLRows |
---|---|---|
29 | 9 | 20 |
- TotalRows is COUNT(*), the cardinality (total number of rows) of the suppliers table
- NonNULLRows is COUNT(region), the total number of rows of the suppliers table in which region is not NULL
- NULLRows is COUNT(*) – COUNT(region), the difference between the cardinality of the suppliers table and the number
of rows in which region is not NULL; basically this returns the number of rows
in which region *is* NULL.
The fact that COUNT(*) and COUNT(column) treat NULLs differently can be used to your advantage in
many situations such as calculating the NULLRows column above.
Notice that we can get the NULLRows count from above with the following query as well:
SELECT COUNT(*) AS NULLRows FROM [Northwind].[dbo].[suppliers] WHERE region IS NULL
This will not work if you specify the region column in your count, however.
Here we'll take a look at why it is important to understand the order in which
SQL performs the steps in a SELECT statement that contains an aggregate
function. This query returns zero as the NULLRows count:
SELECT COUNT(region) AS NULLRows FROM [Northwind].[dbo].[suppliers] WHERE region IS NULL
Why does this query return zero? Remember, by definition COUNT(column) applies the
value expression, the search expression and
then eliminates NULLs before giving you a final count! Here's a play-by-play of what happens in our example:
- The WHERE clause eliminates every row in which region is not NULL.
- The value expression in this example is simply a column name - region.
- The COUNT(region) function eliminates every row in which region is NULL.
- Since we've explicitly eliminated all non-NULL columns with the WHERE clause and implicitly eliminated all NULL
columns with the COUNT(region) statement, we end up with a zero rows.
Gotcha: COUNT(column) eliminates NULL values; COUNT(*) returns the cardinality (total number of rows) for an entire table. The order in which SQL performs the steps in aggregate calculations can have implications for your results. |
SUM()
SUM(), like the other set functions, applies the search expression,
then the value expression, eliminates NULLs and finally returns a sum of the
remaining rows. It is important to note that SUM() excludes NULLs, as NULL scalar arithmetic normally returns NULL. For
instance, let's do a scalar addition of all the values in the reportsto column of the employees table:
SELECT 2 + NULL + 2 + 2 + 2 + 5 + 5 + 2 + 5
This SELECT statement will return NULL, since NULL plus anything always returns NULL. If we use the SUM() function on
the table, however, we can rest assured that NULLs are eliminated from our
result:
SELECT SUM(reportsto) AS RowSum FROM [Northwind].[dbo].[employees]
The result of this query:
RowSum |
---|
25 |
Gotcha: NULLs are eliminated by the SUM() function; in scalar arithmetic, however, NULL added to anything equals NULL. |
AVG() NULL-handling
The AVG() function returns the average of the column, excluding NULLs. There are a couple of "gotcha's"
concerning the AVG() function. The first "gotcha" is that NULLs are excluded, as with other set functions,
so your answer may not be what you expect. The second "gotcha" is that, by definition, AVG() returns a number
of the same scale as the column it is performed on. Again, this could throw your answer off a bit.
We'll tackle the NULL issue first. AVG() specifically excludes NULLs when it calculates averages. The following
query divides the sum of the reportsto column by the count of non-NULL rows. In our example, one row is excluded from the
calculation:
SELECT AVG(reportsto) AS ColAvg FROM [Northwind].[dbo].[employees]
Our result is:
ColAvg |
---|
3 |
This is the same result as if we performed this query:
SELECT SUM(reportsto)/COUNT(reportsto) FROM [Northwind].[dbo].[employees]
So how does excluding NULLs affect our average? We can use COALESCE() function to see.
SELECT AVG(COALESCE(reportsto, 0)) AS ColAvg FROM [Northwind].[dbo].[employees]
The COALESCE(reportsto, 0) statement forces AVG() to count NULL rows as zeroes; forcing every single row to be
counted. This gives us the following result:
ColAvg |
---|
2 |
We can get the same result using the SUM() and COUNT() set functions:
SELECT SUM(reportsto)/COUNT(*) AS ColAvg FROM [Northwind].[dbo].[employees]
ColAvg |
---|
2 |
Here we are performing a SUM() on the rows that are not NULL, then dividing by the total number of rows, effectively
treating the NULL rows as zeroes in our average. In this instance we're dealing with smaller numbers and only nine rows. With
larger numbers and more rows the difference is more pronounced.
So which method should you use? That depends on the result you're looking for. Consider an example of calculating the
average payroll for a department. Suppose you have the following table:
Name | Weekly Pay |
---|---|
Joe M. | 400.00 |
Lisa D. | 500.00 |
Our average is $450.00 per week. Now let's say that we hire a new guy, but his weekly pay has not been put into the system
yet. Now we have:
Name | Weekly Pay |
---|---|
Joe M. | 400.00 |
Lisa D. | 500.00 |
Tom J. | NULL |
If we include Tom's missing salary in our average (treat it like 0.00), our average goes down to $300.00 per week. If we
leave Tom out of it for now – the default for AVG() – we get $450.00 per week again.
So which method is correct? Ignore NULLs or treat them as zeroes in your averages? That depends entirely on your
objectives and corporate policies. The main thing is to be consistent in your calculations. If you ignore NULLs in calculating
averages, use that same methodology every time; otherwise your results will not be comparable from
year to year, quarter to quarter or even day to day.
Gotcha: AVG() eliminates NULLs when calculating your average. |
AVG() Scale
The second issue is scale. In our examples, the averages we received were calculated and returned with the same scale
(digits after the decimal point) as the column; in this case the column is an INT data type, so no decimals were returned.
We'll apply the CAST operator to see what would have been returned if we were using a floating-point data type:
SELECT CAST(AVG(reportsto) AS FLOAT) AS ColAvg FROM [Northwind].[dbo].[employees]
ColAvg |
---|
3.0 |
So far, not much has changed. We received 3 previously and 3.0 this time. So let's look at the equivalent average using
SUM() and COUNT():
SELECT CAST(SUM(reportsto) AS FLOAT)/COUNT(reportsto) AS ColAvg FROM [Northwind].[dbo].[employees]
ColAvg |
---|
3.125 |
Notice that we are forcing floating point precision now by casting the
reportsto column as a FLOAT. We can get this same result by moving the CAST inside the AVG() function, like this:
SELECT (AVG(CAST(reportsto AS FLOAT))) AS ColAvg FROM [Northwind].[dbo].[employees]
So how did we end up with a different value the second time? Since AVG() calculates using the same scale as the
column data type, our INT column average is calculated with no decimal places. In the first example, we CAST the result
of the AVG() function to a FLOAT type, but the AVG() was calculated as an INT first.
In the second example, we CAST the SUM() of the reportsto column to a floating point type, and then
divided; so we
didn't lose any digits after the decimal point in the calculation. As shown, we can get the same result by using CAST on the column inside the
AVG() function. This is particularly important to note in financial
and scientific calculations, where the scale is particularly important to our
calculations.
If your scale requirements are greater than the scale of your column, you might need to use the
SUM(column)/COUNT(column) method of calculating your average and CAST it to a floating or fixed-point type,
or CAST inside of the AVG() function.
Gotcha: AVG() also uses the precision (total number of digits) and scale (number of digits after the decimal point) of the column you are performing the AVG() on. |
MIN(), MAX()
The MIN() and MAX() functions return the minimum and maximum values of a column, respectively. Like the other
set functions, these also eliminate NULLs before returning a result.
For numeric data types, MIN() and MAX() simply return the minimum or maximum number in the column. For
character data types, the values are compared using standard SQL form: the shorter of the values is right-padded with
'pad characters' (usually spaces). This means that character columns with trailing spaces are mixed in with character columns
without trailing spaces. As an example:
IF 'hi' = 'hi ' PRINT 'Equal' ELSE PRINT 'Not Equal'
Depending on your collation settings, string comparisons might be case-sensitive or case-insensitive.
Gotcha: NULLs are eliminated by MIN() and MAX() functions. For character data types, shorter values are right-padded with 'pad characters' (often spaces) before comparing them. |
Conclusion
SQL set functions are very useful in calculating basic statistical values. By knowing how SQL set functions handle NULL
values you can ensure that you always get accurate, high-quality results.
Further Reading
- MSDN, "SET ANSI_WARNINGS". http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp
- ANSI SQL-92 Standard (Draft). http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt