Aggregate Queries

The use of aggregate functions in queries is a popular and powerful feature of the SQL

language. It takes some practice to become proficient in their use.

Query writers are frequently confused about the difference between the WHERE

clause and the HAVING clause or have trouble figuring out which columns to include in the

GROUP BY clause. This article will explain how to write queries using aggregates and, hopefully,

clear up the confusion. It is intended to cover the basics of using aggregate expressions and does

not cover the advanced features.

What are aggregate functions?

Aggregate functions are used to summarize data in queries. They usually work on groups of data,

however, in some cases they will work on the entire table. The most commonly used aggregate

functions are AVG, COUNT, MIN, MAX and SUM. Keep the following in mind when working with these aggregate functions.

The functions AVG and SUM will only operate on numeric columns.
The functions MIN and MAX will work on numeric, character and date columns.
The COUNT function can operate on any column except for text, ntext or image columns.
COUNT can be used with an asterisk (*) to give the count of the rows by group or entire set of results.
The aggregate functions ignore NULL values.
The DISTINCT argument, when used within an aggregate expression, will cause it to operate on non-null unique values.

The following query from the Northwind database finds the count of orders and the maximum

freight grouped by EmployeeID. Rows are filtered out of the results if the maximum freight

of a group is less than 800 or the ShippedDate of an order IS NULL.

SELECT COUNT(*), MAX(Freight), EmployeeID
FROM Orders
WHERE ShippedDate IS NOT NULL
GROUP BY  EmployeeID
HAVING MAX(Freight) >= 800
ORDER BY EmployeeID

The building blocks

Let’s take a look at the parts, known as clauses, of an aggregate query. I have always found

it is easier to write the query if I figure out each clause before moving on to the next one. This

step-by-step approach may work for you, too.

The first part of the statement is the

SELECT clause. It lists the columns and aggregate expressions that are returned to the client

application. Make sure you only include columns that you really need in the results, because

the SELECT clause directly affects which columns will be required in the GROUP BY clause.

Keep these facts in mind when building the SELECT clause.

You can use columns, literal values, aggregate expressions and other expressions in the SELECT clause.
Any columns listed that are not part of an aggregate expression will be used to group the results when you get to the GROUP BY clause.

The next part of the statement is the FROM clause. This clause specifies the tables and

or views from which the data is queried. The tables and views are joined using the usual JOIN syntax that is used for any other T-SQL statement. Refer to Books Online if you need to learn more about how to join tables.

The WHERE clause is a very important part of an aggregate query. By using the WHERE clause you

can eliminate some of the rows before the rows are grouped and the aggregate functions do their work. You can specify

any of the available columns from the tables or views in the WHERE clause even if they haven't been

used. Notice that the

sample query above is filtering on ShippedDate which is not used anywhere else in the query.

You are, however, restricted from including any aggregates in the WHERE clause. This makes sense when you consider that the WHERE clause is processed by SQL before the aggregate functions. Remember these rules when building the WHERE clause:

Any column can be used in the WHERE clause.
Nonaggregate expressions can be used in the WHERE clause.

The GROUP BY clause is one that query writers struggle with quite often. As its name suggest,

it is used to group the data. The aggregate functions are applied to the groups. If a GROUP BY

is not used in the query, then the aggregate functions are applied to the entire set of rows returned.

Remember, the WHERE clause may be used to filter out some of the rows from the table first.

You must include all columns in the GROUP BY clause that are used in the SELECT clause

unless the column

is part of an aggregate expression. One of my college professors insisted that the

opposite was true,

that you have to include in the SELECT clause any columns listed in the GROUP BY clause.

While it is true that the columns in the two clauses usually match, it is possible to list a column

in the GROUP BY clause without listing it in the SELECT clause. I can’t think of any good reason to

do this, however. I will include an example query demonstrating this technique and a better way to write

the same query later in the article.

If you don’t want to group on a column, don’t list it in the SELECT clause.

In the example query above, you can not list information about individual orders since

you are grouping the orders by EmployeeID. If you included the OrderID column in the SELECT

clause, for example, you would then have to list the column in the GROUP BY clause. Your

summary information would no longer apply to EmployeeID but to the individual orders, which wouldn’t be a summary at all.

There are cases where you will need to list the details about the row that satisfies certain

criteria based on an aggregate function, for example, the last order placed by each customer.

An sample query later in the article will show how to do this using a derived table.

Remember these rules when writing the GROUP BY clause:

The results will be grouped by the columns listed in the GROUP BY clause.
All columns found in the SELECT clause, including those used in nonaggregate expressions, must be listed in the GROUP BY clause.

The GROUP BY clause also has some optional arguments: ALL, WITH ROLLUP and WITH CUBE. These will not

be covered in this article.

The HAVING clause is used to filter rows after the grouping has been applied, but before

the results are returned to the client. You can only include aggregate expressions and

columns that are listed in the GROUP BY clause. It is possible to include different aggregate functions than those

in the SELECT clause. In the example query above, you could write a query using the HAVING

clause to include only employees whose orders' average freight was above a certain amount.

You could also include the EmployeeID column in the HAVING clause since it is listed in the GROUP BY clause. Chances are, though, that it would make more sense to filter the EmployeeID column in the WHERE clause instead.

Any column or expression listed in the GROUP BY clause may be included in the HAVING clause.
Filter aggregate results using the HAVING clause, not the WHERE clause.

The last clause is the ORDER BY clause. You might think that the ORDER BY clause would not be necessary -- that SQL would order the results according to the GROUP BY columns, but this is not the case. If you want the results to sort according to the groups, the columns must be specified in the ORDER BY clause. The ORDER BY clause has the same rules as the SELECT clause, only aggregates and columns listed in the GROUP BY are allowed.

Any column or expression listed in the GROUP BY clause may be included in the ORDER BY clause.
The ORDER BY is required if you want ordered results.

Sample queries using Northwind

This query joins two tables, orders and order details. It returns the OrderID column along with the count of line items in each order if they were shipped via method 3 and if they have at least two line items. Finally, the results are ordered by the OrderID column:

SELECT o.OrderID, COUNT(*) AS Lines
FROM orders o JOIN [order details] od
ON o.OrderID = od.OrderID
WHERE ShipVia = 3
GROUP BY o.OrderID
HAVING COUNT(*) > 1
ORDER BY o.OrderID

You can’t update a column using an aggregate function directly, but you can use a derived table to product

the desired results.

First, run this statement to add a column to the Employees table in Northwind:

ALTER TABLE Employees ADD ORDER_COUNT INT NULL

The following query will generate an error:


UPDATE e
SET ORDER_COUNT = COUNT(*)
FROM Employees e JOIN Orders o
ON e.EmployeeID = o.EmployeeID

This update query using a derived table shown in blue works fine:

UPDATE e
SET ORDER_COUNT = ORDERCOUNT
FROM Employees e JOIN 
(SELECT COUNT(*) AS ORDERCOUNT, EmployeeID 
FROM ORDERS 
GROUP BY EmployeeID) o
ON e.EmployeeID = o.EmployeeID

This example uses a couple of expressions. Notice that they do not

need to be included in the GROUP BY clause because they do not use any of the table's columns:

SELECT getDate() as CurrentDate, 1 + 2 as Add1Plus2, 
EmployeeID, count(*) as ORDERCOUNT
FROM Orders
GROUP BY EmployeeID

The query below will run without error because the column, OrderDate, used in the

expressions is listed in the GROUP BY clause. It does not, however, give the expected results:


SELECT COUNT(*) AS OrderCount, 
   MONTH(OrderDate) AS OrderMonth,
   YEAR(OrderDate) AS OrderYear
FROM Orders
GROUP BY OrderDate
ORDER BY OrderDate

To get the correct results, a summary by year and month, the query must be grouped by the actual expressions, not just

the column:

SELECT COUNT(*) AS OrderCount, 
   MONTH(OrderDate) AS OrderMonth, 
   YEAR(OrderDate) AS OrderYear
FROM Orders GROUP BY YEAR(OrderDate), MONTH(OrderDate)
ORDER BY YEAR(OrderDate), MONTH(OrderDate)

It is possible to list a column in the GROUP BY clause and not in the SELECT clause. Here is a query

that runs but doesn't really return any useful information:

SELECT MAX(OrderID) AS lastOrder
FROM Orders
GROUP BY CustomerID

There may be times when this technique can be used with a sub-query to solve a particular

problem. In this example, I am listing more information about the latest order for each customer:

SELECT OrderID, CustomerID, OrderDate, EmployeeID, 
   ShipVia, Freight, ShipName
FROM Orders
WHERE OrderID in (
   SELECT MAX(OrderID) 
   FROM Orders
   GROUP BY CustomerID)

The query could be better written with a derived table:

SELECT OrderID, o.CustomerID, EmployeeID, OrderDate, ShipVia, Freight, ShipName
FROM Orders o JOIN 
   (SELECT MAX(OrderID) as MaxOrder, CustomerID FROM orders GROUP BY CustomerID) m
ON o.CustomeriD = m.CustomerID
WHERE OrderID = MaxOrder

Conclusion

Writing aggregate queries just takes a little practice once you understand the rules.

Remember to first figure out what you need to see in the results (SELECT) and which tables or

views the data will come from (FROM). Filter out any rows you don’t want included at all (WHERE).

Determine how the results will be grouped (GROUP BY) by looking at the SELECT clause.

Decide if you need to filter based on the groups using aggregate functions (HAVING).

Finally, sort the results (ORDER BY). Often a derived table will help solve more complex problems.

Aggregate Queries