Aggregate Queries
The use of aggregate functions in queries is a popular and powerful feature of the SQL
language. It takes some practice to become proficient in their use.
Query writers are frequently confused about the difference between the WHERE
clause and the HAVING clause or have trouble figuring out which columns to include in the
GROUP BY clause. This article will explain how to write queries using aggregates and, hopefully,
clear up the confusion. It is intended to cover the basics of using aggregate expressions and does
not cover the advanced features.
What are aggregate functions?
Aggregate functions are used to summarize data in queries. They usually work on groups of data,
however, in some cases they will work on the entire table. The most commonly used aggregate
functions are AVG, COUNT, MIN, MAX and SUM. Keep the following in mind when working with these aggregate functions.
- The functions AVG and SUM will only operate on numeric columns.
- The functions MIN and MAX will work on numeric, character and date columns.
- The COUNT function can operate on any column except for text, ntext or image columns.
- COUNT can be used with an asterisk (*) to give the count of the rows by group or entire set of results.
- The aggregate functions ignore NULL values.
- The DISTINCT argument, when used within an aggregate expression, will cause it to operate on non-null unique values.
The following query from the Northwind database finds the count of orders and the maximum
freight grouped by EmployeeID. Rows are filtered out of the results if the maximum freight
of a group is less than 800 or the ShippedDate of an order IS NULL.
SELECT COUNT(*), MAX(Freight), EmployeeID FROM Orders WHERE ShippedDate IS NOT NULL GROUP BY EmployeeID HAVING MAX(Freight) >= 800 ORDER BY EmployeeID
The building blocks
Let’s take a look at the parts, known as clauses, of an aggregate query. I have always found
it is easier to write the query if I figure out each clause before moving on to the next one. This
step-by-step approach may work for you, too.
The first part of the statement is the
SELECT clause. It lists the columns and aggregate expressions that are returned to the client
application. Make sure you only include columns that you really need in the results, because
the SELECT clause directly affects which columns will be required in the GROUP BY clause.
Keep these facts in mind when building the SELECT clause.
- You can use columns, literal values, aggregate expressions and other expressions in the SELECT clause.
- Any columns listed that are not part of an aggregate expression will be used to group the results when you get to the GROUP BY clause.
The next part of the statement is the FROM clause. This clause specifies the tables and
or views from which the data is queried. The tables and views are joined using the usual JOIN syntax that is used for any other T-SQL statement. Refer to Books Online if you need to learn more about how to join tables.
The WHERE clause is a very important part of an aggregate query. By using the WHERE clause you
can eliminate some of the rows before the rows are grouped and the aggregate functions do their work. You can specify
any of the available columns from the tables or views in the WHERE clause even if they haven't been
used. Notice that the
sample query above is filtering on ShippedDate which is not used anywhere else in the query.
You are, however, restricted from including any aggregates in the WHERE clause. This makes sense when you consider that the WHERE clause is processed by SQL before the aggregate functions. Remember these rules when building the WHERE clause:
- Any column can be used in the WHERE clause.
- Nonaggregate expressions can be used in the WHERE clause.
The GROUP BY clause is one that query writers struggle with quite often. As its name suggest,
it is used to group the data. The aggregate functions are applied to the groups. If a GROUP BY
is not used in the query, then the aggregate functions are applied to the entire set of rows returned.
Remember, the WHERE clause may be used to filter out some of the rows from the table first.
You must include all columns in the GROUP BY clause that are used in the SELECT clause
unless the column
is part of an aggregate expression. One of my college professors insisted that the
opposite was true,
that you have to include in the SELECT clause any columns listed in the GROUP BY clause.
While it is true that the columns in the two clauses usually match, it is possible to list a column
in the GROUP BY clause without listing it in the SELECT clause. I can’t think of any good reason to
do this, however. I will include an example query demonstrating this technique and a better way to write
the same query later in the article.
If you don’t want to group on a column, don’t list it in the SELECT clause.
In the example query above, you can not list information about individual orders since
you are grouping the orders by EmployeeID. If you included the OrderID column in the SELECT
clause, for example, you would then have to list the column in the GROUP BY clause. Your
summary information would no longer apply to EmployeeID but to the individual orders, which wouldn’t be a summary at all.
There are cases where you will need to list the details about the row that satisfies certain
criteria based on an aggregate function, for example, the last order placed by each customer.
An sample query later in the article will show how to do this using a derived table.
Remember these rules when writing the GROUP BY clause:
- The results will be grouped by the columns listed in the GROUP BY clause.
- All columns found in the SELECT clause, including those used in nonaggregate expressions, must be listed in the GROUP BY clause.
The GROUP BY clause also has some optional arguments: ALL, WITH ROLLUP and WITH CUBE. These will not
be covered in this article.
The HAVING clause is used to filter rows after the grouping has been applied, but before
the results are returned to the client. You can only include aggregate expressions and
columns that are listed in the GROUP BY clause. It is possible to include different aggregate functions than those
in the SELECT clause. In the example query above, you could write a query using the HAVING
clause to include only employees whose orders' average freight was above a certain amount.
You could also include the EmployeeID column in the HAVING clause since it is listed in the GROUP BY clause. Chances are, though, that it would make more sense to filter the EmployeeID column in the WHERE clause instead.
- Any column or expression listed in the GROUP BY clause may be included in the HAVING clause.
- Filter aggregate results using the HAVING clause, not the WHERE clause.
The last clause is the ORDER BY clause. You might think that the ORDER BY clause would not be necessary -- that SQL would order the results according to the GROUP BY columns, but this is not the case. If you want the results to sort according to the groups, the columns must be specified in the ORDER BY clause. The ORDER BY clause has the same rules as the SELECT clause, only aggregates and columns listed in the GROUP BY are allowed.
- Any column or expression listed in the GROUP BY clause may be included in the ORDER BY clause.
- The ORDER BY is required if you want ordered results.
Sample queries using Northwind
This query joins two tables, orders and order details. It returns the OrderID column along with the count of line items in each order if they were shipped via method 3 and if they have at least two line items. Finally, the results are ordered by the OrderID column:
SELECT o.OrderID, COUNT(*) AS Lines FROM orders o JOIN [order details] od ON o.OrderID = od.OrderID WHERE ShipVia = 3 GROUP BY o.OrderID HAVING COUNT(*) > 1 ORDER BY o.OrderID
You can’t update a column using an aggregate function directly, but you can use a derived table to product
the desired results.
First, run this statement to add a column to the Employees table in Northwind:
ALTER TABLE Employees ADD ORDER_COUNT INT NULL
The following query will generate an error:
UPDATE e SET ORDER_COUNT = COUNT(*) FROM Employees e JOIN Orders o ON e.EmployeeID = o.EmployeeID
This update query using a derived table shown in blue works fine:
UPDATE e SET ORDER_COUNT = ORDERCOUNT FROM Employees e JOIN (SELECT COUNT(*) AS ORDERCOUNT, EmployeeID FROM ORDERS GROUP BY EmployeeID) o ON e.EmployeeID = o.EmployeeID
This example uses a couple of expressions. Notice that they do not
need to be included in the GROUP BY clause because they do not use any of the table's columns:
SELECT getDate() as CurrentDate, 1 + 2 as Add1Plus2, EmployeeID, count(*) as ORDERCOUNT FROM Orders GROUP BY EmployeeID
The query below will run without error because the column, OrderDate, used in the
expressions is listed in the GROUP BY clause. It does not, however, give the expected results:
SELECT COUNT(*) AS OrderCount, MONTH(OrderDate) AS OrderMonth, YEAR(OrderDate) AS OrderYear FROM Orders GROUP BY OrderDate ORDER BY OrderDate
To get the correct results, a summary by year and month, the query must be grouped by the actual expressions, not just
the column:
SELECT COUNT(*) AS OrderCount, MONTH(OrderDate) AS OrderMonth, YEAR(OrderDate) AS OrderYear FROM Orders GROUP BY YEAR(OrderDate), MONTH(OrderDate) ORDER BY YEAR(OrderDate), MONTH(OrderDate)
It is possible to list a column in the GROUP BY clause and not in the SELECT clause. Here is a query
that runs but doesn't really return any useful information:
SELECT MAX(OrderID) AS lastOrder FROM Orders GROUP BY CustomerID
There may be times when this technique can be used with a sub-query to solve a particular
problem. In this example, I am listing more information about the latest order for each customer:
SELECT OrderID, CustomerID, OrderDate, EmployeeID, ShipVia, Freight, ShipName FROM Orders WHERE OrderID in ( SELECT MAX(OrderID) FROM Orders GROUP BY CustomerID)
The query could be better written with a derived table:
SELECT OrderID, o.CustomerID, EmployeeID, OrderDate, ShipVia, Freight, ShipName FROM Orders o JOIN (SELECT MAX(OrderID) as MaxOrder, CustomerID FROM orders GROUP BY CustomerID) m ON o.CustomeriD = m.CustomerID WHERE OrderID = MaxOrder
Conclusion
Writing aggregate queries just takes a little practice once you understand the rules.
Remember to first figure out what you need to see in the results (SELECT) and which tables or
views the data will come from (FROM). Filter out any rows you don’t want included at all (WHERE).
Determine how the results will be grouped (GROUP BY) by looking at the SELECT clause.
Decide if you need to filter based on the groups using aggregate functions (HAVING).
Finally, sort the results (ORDER BY). Often a derived table will help solve more complex problems.