In Stairway 6 I showed you how to sort your data using the ORDER BY clause. This allowed you to organize your detailed records in sort order based on single, or multiple columns. Detailed data is great if you want to see data in specific records, but sometimes you need to roll up the detailed data into summarized values. Summarizing your data you can done using the GROUP BY clause.
There are two types of GROUP BY clauses. One that is known as the simple GROUP BY clause and another that provides a little more than simple summarization that is called the general GROUP BY clause. The main difference between these two type is the simple GROUP BY contains just the GROUP BY clause, whereas the general GROUP BY clause contains other operators like ROLLUP and CUBE.
In this article I will be covering how to group data using the simple GROUP BY clause. In a follow-up article I will be covering the more complex general GROUP BY clause.
Simple GROUP BY clause
Using the simple GROUP BY clause allows you to aggregate your data based on a single column, multiple columns or expressions. Only one summarized row will be returned for each unique value based on the columns and/or expressions specified in the GROUP BY clause. When SQL Server processes a GROUP BY clause it groups the detailed records by the unique column or expression values, and then summarizes each set based on the aggregation functions included in the select list.
To better grasp how to use the GROUP BY, let’s assume you have a table that contains detailed sales information for different stores and you want to summarize total sales amount by store. You can use the GROUP BY clause to aggregate the total sales amount by each store. In this example the unique column you would group on would be store name, and the column to be aggregated would be the sales amount. Your results would show one row for each unique store name, and the row for each store would contain the sum of the sales amounts for that store.
SQL Server has some limitation on what columns can be included in the SELECT list of a GROUP BY query. Each column specified in the SELECT list of a GROUP BY query needs to fall into one of the following categories:
- A column specified in the GROUP BY clause
- An expression specified in the GROUP BY clause
- The value returned from an aggregate function
If a column doesn’t fall into one of these categories you will get an error when trying to run your GROUP BY query. Note a column or expression contained in the GROUP BY clause is not required to be in the select list.
Let me go through a couple of examples to help demonstrate how to obtain summarized values using the simple GROUP BY clause.
Sample Data for Exploring the Simple GROUP BY Clause
In order to demonstrate how to use a simple GROUP BY clause I need to build some sample data. I am providing a script to create my sample data so you can run the sample code provided in this article. Use the script in Listing 1 to build and populate the sample tables.
USE tempdb; GO SET NOCOUNT ON; -- Create Sales Table CREATE TABLE dbo.SalesTransaction (Id INT IDENTITY PRIMARY KEY ,CustomerName VARCHAR(65) ,TotalSalesAmount money ,SalesTypeDesc VARCHAR(200) ,SalesDateTime DATETIME ,StoreName VARCHAR(100)); -- Add data to Sales Table INSERT INTO dbo.SalesTransaction VALUES ('John Smith', 124.23,'Software','09/22/2011 11:51:12 AM','The Software Outlet'); INSERT INTO dbo.SalesTransaction VALUES ('Jack Thomas', 29.56,'Computer Supplies','09/23/2011 10:21:49 AM','The Software Outlet'); INSERT INTO dbo.SalesTransaction VALUES ('Sue Hunter', 89.45,'Computer Supplies','09/23/2011 2:51:56 AM','The Software Outlet'); INSERT INTO dbo.SalesTransaction VALUES ('Karla Johnson', 759.12,'Software','09/23/2011 2:54:37 PM','The Software Outlet'); INSERT INTO dbo.SalesTransaction VALUES ('Gary Clark', 81.51,'Software','09/22/2011 11:08:52 AM','Discount Software'); INSERT INTO dbo.SalesTransaction VALUES ('Scott Crochet', 12345.78,'Computer Supplies','09/23/2011 3:12:37 PM','Discount Software'); INSERT INTO dbo.SalesTransaction VALUES ('Sheri Holtz', 12.34,'Software','09/23/2011 10:51:42 AM','Discount Software'); INSERT INTO dbo.SalesTransaction VALUES ('Mary Lee', 101.34,'Software','09/23/2011 09:37:19 AM','Discount Software'); INSERT INTO dbo.SalesTransaction VALUES ('Sally Davisson', 871.12,'Software','09/22/2011 05:21:28 PM','Discount Software'); INSERT INTO dbo.SalesTransaction VALUES ('Rod Kaplan', 2345.19,'Computer Supplies','09/23/2011 5:01:11 PM','Discount Software'); INSERT INTO dbo.SalesTransaction VALUES ('Sandy Roberts', 76.38,'Books','09/23/2011 4:51:57 PM','Computer Books and Software'); INSERT INTO dbo.SalesTransaction VALUES ('Marc Trotter', 562.94,'Software','09/23/2011 6:51:43 PM','Computer Books and Software');
Listing 1: Script to create Sample Data
If you look through the script in Listing 1 you will find I created the dbo.SalesTransaction table. I then inserted a number of records into this table. I will use this table to demonstrate how to use a simple GROUP BY clause to aggregate data.
Grouping by a Single Column
Using the sample table created using Listing 1 , this first example will use the GROUP BY clause to summarize data based on a single column. My example in Listing 2 summarizes my sample data based on the StoreName column.
USE tempdb; GO SELECT StoreName ,SUM(TotalSalesAmount) AS StoreSalesAmount FROM dbo.SalesTransaction GROUP BY StoreName;
Listing 2: GROUP BY based on Single columns
When the code in Listing 2 is executed against my sample table the following aggregated rows in Report 1 are returned.
StoreName StoreSalesAmount ------------------------------------------ ---------------- Computer Books and Software 639.32 Discount Software 15757.28 The Software Outlet 1002.36 |
Report 1: Summarizing sample data based on a single column
If you review the output in Report 1 you can see that only one aggregated row is returned for each unique value of StoreName. The StoreSalesAmount on each record is calculated by summing up the TotalSalesAmount column for each store’s Sales records using the SUM function.
Grouping by Multiple Columns
There are times when you need to develop a report in which your data needs to be grouped by multiple columns. In order to accomplish this all you need to do is add additional columns to the GROUP BY clause. When multiple columns are specified in the GROUP BY clause SQL Server aggregates the detailed rows based on each unique combination of values from the columns in the GROUP BY clause. In Listing 3 I have expanded the query in Listing 2 by adding a second column to the GROUP BY clause.
USE tempdb; GO SELECT StoreName, SalesTypeDesc ,SUM(TotalSalesAmount) AS StoreSalesAmount FROM dbo.SalesTransaction GROUP BY StoreName, SalesTypeDesc;
Listing 3: GROUP BY based on Single columns
When I run the code in Listing 3 against my sample data I get the results in Report 2.
StoreName SalesTypeDesc StoreSalesAmount ---------------------------- ------------------ ---------------- Computer Books and Software Books 76.38 Discount Software Computer Supplies 14690.97 The Software Outlet Computer Supplies 119.01 Computer Books and Software Software 562.94 Discount Software Software 1066.31 The Software Outlet Software 883.35 |
Report 2: Output from Running Listing 1
In Report 2 you can see that StoreSalesAmount is now summarized at the StoreName and SalesTypeDesc level. Also note that the aggregated rows returned are not in sorted order based on the columns in the GROUP BY clause. If I wanted the summarized data to appear in StoreName order then I would have needed to include an ORDER BY clause on the SELECT statement. I will leave it up to you to add the ORDER BY to the code in Listing 3 to return the summarized data in StoreName order.
Using an Expression in the GROUP BY Clause
There are times that you may want to group your data by something other than a specific column or set of columns. For example you might want to summarize your data based on the first few characters of some VARCHAR column, or maybe just the date, or month of a DATETIME column. SQL Server allows you to specify expressions in the GROUP BY clause to accomplish this. An expression could be any valid expression that is based on a column in the detailed record set that is being aggregated. To demonstrate how to use an expression in the GROUP BY clause look at the code in Listing 4.
USE tempdb; GO SELECT CONVERT(CHAR(10),SalesDateTime,101) AS SalesDate ,SUM(TotalSalesAmount) AS TotalSalesAmount FROM dbo.SalesTransaction GROUP BY CONVERT(CHAR(10),SalesDateTime,101);
Listing 4: GROUP BY based on Single columns
In listing 4, the SELECT statement is grouping the data based on an expression, in this case a CONVERT function. If you use an expression in the GROUP BY clause the same exact expression must be used in the SELECT list. The CONVERT function is parsing the SalesDateTime column, and returning only the date portion of this column. Using the CONVERT function in the GROUP BY clause allows me to summarize the Sales data based on the actual dates of the different Sales records. By doing this I was able to summarize my sample to get the TotalSalesAmount by date for all stores, as shown in Report 3.
SalesDate TotalSalesAmount ---------- ---------------- 09/22/2011 1076.86 09/23/2011 16322.10 |
Report 3: Output when summarizing data based on expression
Using expressions allows to you to programmatically identify which parts of your detailed data will be used to aggregate your data.
HAVING Clause
If you are aggregating data with the GROUP BY clause you might want to not return all the aggregated values. Instead you may want to only return a subset of the aggregated values. The HAVING clause can be used to selectively identify the aggregated values you want to return from the GROUP BY summarization.
Normally when we SELECT data we use the WHERE clause to restrict the rows that are returned. The only problem with that is the WHERE clause operates on row values, and not aggregated values. Therefore the WHERE clause is unable to use aggregated values created by the GROUP BY clause. However, adding a HAVING clause after your GROUP BY clause provides you a way to specify conditions to identify the specific summarized values that you want returned. To better understand this let me provide you with a couple of examples.
One of the common things the HAVING clause might be used for when looking at store sales data is to determine the stores that are not meeting a specific sales quota. If you wanted to find all the stores that didn’t meet a minimum sales amount you could do that with the code in Listing 5.
USE tempdb; GO SELECT StoreName ,SUM(TotalSalesAmount) AS StoreSalesAmount FROM dbo.SalesTransaction GROUP BY StoreName HAVING SUM(TotalSalesAmount) < 1000.00;
Listing 5: Restricting result set by using HAVING clause
In Listing 5 I restricted the result set to those stores that had an aggregated TotalSalesAmount of less than 1000.00. In my trivial example here, you will find the StoreName of “Computer Books and Software” is the only store that didn’t meet the $1000.00 sales quota amount.
The HAVING clause can be used on columns that are also not aggregated. If you want to restrict the rows returned based on specific value of any one of the columns used in the GROUP BY clause you can also do that, and Listing 6 demonstrates this.
USE tempdb; GO SELECT StoreName ,SUM(TotalSalesAmount) AS StoreSalesAmount FROM dbo.SalesTransaction GROUP BY StoreName HAVING StoreName LIKE '%Outlet%' OR StoreName LIKE '%Books%';
Listing 6: Restricting result set based on GROUP BY column
In Listing 6 I only wanted to see summarized data for stores that have either “Outlet” or “Books” in their store name This example also demonstrates that you can have multiple conditions in the HAVING clause. Another way to think of the difference between WHERE and HAVING is that the WHERE clause filters out data rows before the data is aggregated, and the HAVING clause filters out aggregated rows after the GROUP BY is applied.
Summarizing Data with the Simple GROUP BY Clause
In this article I showed you how to use the simple GROUP BY clause to summarize your data. I discussed how to use a single column, multiple columns, as well as expressions in the GROUP BY clause to summarize detailed data. By using what I have demonstrated, you should now be able to build a simple GROUP BY clause to summarize your data, and optionally filter the summarized data using HAVING.
In my next article I will be expanding my discussion of the GROUP BY clause. In this follow-up article I will show you how to use the CUBE and ROLLUP operators to produce additional summarized data, such as sub-total and grand total values.