This is a reprint with some revisions of a series I originally published on LessThanDot. You can find the links to the original blogs on my Series page.
This is part 3 in my series on SQL window functions. In this post, we will explore using aggregation functions with T-SQL windows. SQL Server supports most of the aggregation functions such as SUM and AVG in this context with the exceptions of GROUPING and GROUPING_ID. However, prior to SQL Server 2012 only the PARTITION BY clause was supported which greatly limited the usability of aggregate window functions. When support for the ORDER BY clause was introduced in SQL Server 2012, more complex business problems such as running totals could be solved without the extensive use of cursors or nested select statement. In my experience, I used to try various ways to get around this limitation including pushing the data to .NET as it could solve this problem more efficiently. However, this was not always possible when working with reporting. Now that we are able to use SQL to solve the problem, more complex and low-performing solutions can be replaced with these window functions.
Once again, the following CTE will be used as the query in all examples throughout the post:
with CTEOrders as
(select cast(1 as int) as OrderID, cast(‘3/1/2012′ as date) as OrderDate, cast(10.00 as money) as OrderAmt, ‘Joe’ as CustomerName
union select 2, ‘3/1/2012′, 11.00, ‘Sam’
union select 3, ‘3/2/2012′, 10.00, ‘Beth’
union select 4, ‘3/2/2012′, 15.00, ‘Joe’
union select 5, ‘3/2/2012′, 17.00, ‘Sam’
union select 6, ‘3/3/2012′, 12.00, ‘Joe’
union select 7, ‘3/4/2012′, 10.00, ‘Beth’
union select 8, ‘3/4/2012′, 18.00, ‘Sam’
union select 9, ‘3/4/2012′, 12.00, ‘Joe’
union select 10, ‘3/4/2012′, 11.00, ‘Beth’
union select 11, ‘3/5/2012′, 14.00, ‘Sam’
union select 12, ‘3/6/2012′, 17.00, ‘Beth’
union select 13, ‘3/6/2012′, 19.00, ‘Joe’
union select 14, ‘3/7/2012′, 13.00, ‘Beth’
union select 15, ‘3/7/2012′, 16.00, ‘Sam’
)
select OrderID
,OrderDate
,OrderAmt
,CustomerName
from CTEOrders;
Using PARTITION BY with Aggregate Functions
SQL Server 2005 and the newer versions supports the usage of the PARTITION BY clause by itself. This allowed for some simple aggregate windows. The following example shows SUM and AVG for different partitions of data. The third function actually creates and average using a SUM and COUNT function.
select CustomerName
,OrderDate
,OrderAmt
,SUM(OrderAmt) OVER (PARTITION BY CustomerName) CustomerTotal
,AVG(OrderAmt) OVER (PARTITION BY OrderDate) AvgDailyAmt
,CAST(COUNT(OrderID) OVER (PARTITION BY OrderDate) as decimal(8,3)) / CAST(COUNT(OrderID) OVER() as decimal(8,3)) PctOfTotalPerDay
from CTEOrders
order by OrderDate;
NOTE: The COUNT aggregate returns an integer value. In order to return the decimal, the values need to be explicitly converted to decimal types. Otherwise, the result was rounding to zero for all results in this sample.
Results
CustomerName | OrderDate | OrderAmt | CustomerTotal | AvgDailyAmt | PctOfTotalPerDay |
Joe | 3/1/2012 | 10 | 68 | 10.5 | 0.133333333 |
Sam | 3/1/2012 | 11 | 76 | 10.5 | 0.133333333 |
Sam | 3/2/2012 | 17 | 76 | 14 | 0.2 |
Joe | 3/2/2012 | 15 | 68 | 14 | 0.2 |
Beth | 3/2/2012 | 10 | 61 | 14 | 0.2 |
Joe | 3/3/2012 | 12 | 68 | 12 | 0.066666667 |
Joe | 3/4/2012 | 12 | 68 | 12.75 | 0.266666667 |
Beth | 3/4/2012 | 10 | 61 | 12.75 | 0.266666667 |
Beth | 3/4/2012 | 11 | 61 | 12.75 | 0.266666667 |
Sam | 3/4/2012 | 18 | 76 | 12.75 | 0.266666667 |
Sam | 3/5/2012 | 14 | 76 | 14 | 0.066666667 |
Beth | 3/6/2012 | 17 | 61 | 18 | 0.133333333 |
Joe | 3/6/2012 | 19 | 68 | 18 | 0.133333333 |
Beth | 3/7/2012 | 13 | 61 | 14.5 | 0.133333333 |
Sam | 3/7/2012 | 16 | 76 | 14.5 | 0.133333333 |
Using Subselects
Subselect statements in SQL Server are supported, but harder to optimize in SQL Server versus Oracle. Until window functions were introduced all of the situations above could be solved by subselects, but performance would degrade as the results needed to work with larger sets of data. With the improved functionality in SQL Server 2012, you should not need to use subselects to return row-based aggregations. Besides the performance implications, maintenance will also be much simpler as the SQL becomes more transparent. For reference, here is the subselect syntax to return the same results as above:
select cte.CustomerName
, cte.OrderDate
, cte.OrderAmt
, (select SUM(OrderAmt) from CTEOrders where CustomerName = cte.CustomerName) CustomerTotal
, (select cast(COUNT(OrderID) as decimal(8,3)) from CTEOrders where OrderDate = cte.OrderDate) / (select cast(COUNT(OrderID) as decimal(8,3)) from CETOrders) AvgDailyAmt
from CETOrders cte
order by cte.OrderDate;
While it is possible to solve the same function using the subselects, the code is already getting messier and with data sets larger than what we have here, you would definitely see performance degradation.
Some Thoughts on GROUP BY
While I am digressing, I wanted to also highlight some details concerning GROUP BY. The one the biggest difficulties working with the GROUP BY clause and aggregates, every column must either be a part of the GROUP BY or have an aggregation associated with it. The window functions help solve this problem as well.
In the following examples, the first query returns the sum of the amount by day. This is pretty standard logic when working with aggregated queries in SQL.
select OrderDate
,sum(OrderAmt) as DailyOrderAmt
from CTEOrders
group by OrderDate;
However, if you wanted to see more details, but not include them in the aggregation, the following will not work.
select OrderDate
,OrderID
,OrderAmt
,sum(OrderAmt) as DailyOrderAmt
from CTEOrders
group by OrderDate
,OrderID
,OrderAmt;
This SQL statement will return each row individually with the sum at the detail level. You can solve this using the subselect above which is not recommended or you can use a window function.
select OrderDate
,OrderID
,OrderAmt
,sum(OrderAmt) OVER (PARTITION BY OrderDate) as DailyOrderAmt
from CTEOrders
As you can see here and in previous examples the OVER clause allows you to manage the grouping based on the context specified in relationship to the current row.
One other twist on the GROUP BY clause. First, I need to give credit to Itzik Ben-Gan for calling this to my attention at one of our Minnesota SQL Server User Group meetings. In his usual fashion he was showing some T-SQL coolness and he showed an interesting error when using the OVER clause with the GROUP BY clause.
The following will return an error because the first expression is an aggregate, but the second expression which is using the OVER clause is not. Also note that in this example the OVER clause is being evaluated for the entire set of data.
select sum(OrderAmt)
, sum(OrderAmt) over() as TotalOrderAmt
from CTEOrders
group by CustomerName
The expression above returns the following error:
Column ‘CTEOrders.OrderAmt’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
The goal of the statement above was to show the customer’s total order amount with the overall order amount. The following statement resolves this issue because it is aggregating the aggregates. The window is now summing the aggregated amount which are grouped on the customer name.
select sum(OrderAmt)
, sum(sum(OrderAmt)) over() as TotalOrderAmt
from CTEOrders
group by CustomerName
Thanks again to Itzik for bringing this problem and resolution to my attention.
Aggregates with ORDER BY
With the expansion of the OVER clause to include ORDER BY support with aggregates, window functions increased their value substantially. One of the key business problems this allowed us to solve was a running aggregate.
The first example shows how to get a running total by CustomreName based on OrderDate and OrderID.
select OrderID
,OrderDate
,OrderAmt
,CustomerName
,SUM(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderDate, OrderID) as RunningByCustomer
from CTEOrders
ORDER BY CustomerName, OrderDate;
Results
OrderID | OrderDate | OrderAmt | CustomerName | RunningByCustomer |
3 | 3/2/2012 | 10 | Beth | 10 |
7 | 3/4/2012 | 10 | Beth | 20 |
10 | 3/4/2012 | 11 | Beth | 31 |
12 | 3/6/2012 | 17 | Beth | 48 |
14 | 3/7/2012 | 13 | Beth | 61 |
1 | 3/1/2012 | 10 | Joe | 10 |
4 | 3/2/2012 | 15 | Joe | 25 |
6 | 3/3/2012 | 12 | Joe | 37 |
9 | 3/4/2012 | 12 | Joe | 49 |
13 | 3/6/2012 | 19 | Joe | 68 |
2 | 3/1/2012 | 11 | Sam | 11 |
5 | 3/2/2012 | 17 | Sam | 28 |
8 | 3/4/2012 | 18 | Sam | 46 |
11 | 3/5/2012 | 14 | Sam | 60 |
15 | 3/7/2012 | 16 | Sam | 76 |
This next example is more creative. It begins to show how powerful the window functions are. In this statement, we are going to return the annual running total aggregated by day. The differentiator here is that we use a DATEPART function in the OVER clause to achieve the desired results.
select OrderID
,OrderDate
,OrderAmt
,CustomerName
,SUM(OrderAmt) OVER (PARTITION BY datepart(yyyy, OrderDate) ORDER BY OrderDate) as AnnualRunning
from CTEOrders
ORDER BY OrderDate;
Results
OrderID | OrderDate | OrderAmt | CustomerName | AnnualRunning |
1 | 3/1/2012 | 10 | Joe | 21 |
2 | 3/1/2012 | 11 | Sam | 21 |
3 | 3/2/2012 | 10 | Beth | 63 |
4 | 3/2/2012 | 15 | Joe | 63 |
5 | 3/2/2012 | 17 | Sam | 63 |
6 | 3/3/2012 | 12 | Joe | 75 |
7 | 3/4/2012 | 10 | Beth | 126 |
8 | 3/4/2012 | 18 | Sam | 126 |
9 | 3/4/2012 | 12 | Joe | 126 |
10 | 3/4/2012 | 11 | Beth | 126 |
11 | 3/5/2012 | 14 | Sam | 140 |
12 | 3/6/2012 | 17 | Beth | 176 |
13 | 3/6/2012 | 19 | Joe | 176 |
14 | 3/7/2012 | 13 | Beth | 205 |
15 | 3/7/2012 | 16 | Sam | 205 |
The ORDER BY clause creates an expanding group within the partition. In the examples above, the partition was the customer. Within each partition, ordered groups based on OrderDate and OrderID are “created”. At each row, the OrderDate and OrderID groups are aggregated up to the current row’s group thus producing the running total. If more than one row has the same order grouping, all of the rows in the group are aggregated into the total as shown in the second example above with the days and years.
Aggregates with ROWS
The ROWS clause is used to further define the partition by specifying which physical rows to include based on their proximity to the current row. As noted in the first post in the series, ROWS requires the ORDER BY clause as this determines the orientation of the partition.
The following example uses the FOLLOWING keywords to find the next two purchases that the customer made.
select OrderID
,OrderDate
,OrderAmt
,CustomerName
,SUM(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderDate, OrderID ROWS BETWEEN 1 FOLLOWING and 2 FOLLOWING) as NextTwoAmts
from CTEOrders
order by CustomerName, OrderDate, OrderID;
Results
OrderID | OrderDate | OrderAmt | CustomerName | NextTwoAmts |
3 | 3/2/2012 | 10 | Beth | 21 |
7 | 3/4/2012 | 10 | Beth | 28 |
10 | 3/4/2012 | 11 | Beth | 30 |
12 | 3/6/2012 | 17 | Beth | 13 |
14 | 3/7/2012 | 13 | Beth | NULL |
1 | 3/1/2012 | 10 | Joe | 27 |
4 | 3/2/2012 | 15 | Joe | 24 |
6 | 3/3/2012 | 12 | Joe | 31 |
9 | 3/4/2012 | 12 | Joe | 19 |
13 | 3/6/2012 | 19 | Joe | NULL |
2 | 3/1/2012 | 11 | Sam | 35 |
5 | 3/2/2012 | 17 | Sam | 32 |
8 | 3/4/2012 | 18 | Sam | 30 |
11 | 3/5/2012 | 14 | Sam | 16 |
15 | 3/7/2012 | 16 | Sam | NULL |
As we noted in the first blog, the last two rows in the partition only contain partial values. For example, order 12 contains the sum of only one order, 14, and order 14 has now rows following it in the partition and returns NULL as a result. When working with the ROWS clause this must be taken into account.
Aggregates with RANGE
Lastly, adding the RANGE to the OVER clause allows you to create aggregates which go to the beginning or end of the partition. RANGE is commonly used with UNBOUNDED FOLLOWING which goes to the end of the partition and UNBOUNDED PRECEDING which goes to the beginning of the partition. One of the most common use would be to specify the rows from the beginning of the partition to the current row which allows for aggregations such as year to date.
In the example below, we are calculating the average order size over time to the current row. This could be a very effective in a trending report.
select OrderID
,OrderDate
,OrderAmt
,CustomerName
,AVG(OrderAmt) OVER (ORDER BY OrderID RANGE BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as AvgOrderAmt
from CTEOrders
order by OrderDate;
Results
OrderID | OrderDate | OrderAmt | CustomerName | AvgOrderAmt |
1 | 3/1/2012 | 10 | Joe | 10 |
2 | 3/1/2012 | 11 | Sam | 10.5 |
3 | 3/2/2012 | 10 | Beth | 10.333333 |
4 | 3/2/2012 | 15 | Joe | 11.5 |
5 | 3/2/2012 | 17 | Sam | 12.6 |
6 | 3/3/2012 | 12 | Joe | 12.5 |
7 | 3/4/2012 | 10 | Beth | 12.142857 |
8 | 3/4/2012 | 18 | Sam | 12.875 |
9 | 3/4/2012 | 12 | Joe | 12.777777 |
10 | 3/4/2012 | 11 | Beth | 12.6 |
11 | 3/5/2012 | 14 | Sam | 12.727272 |
12 | 3/6/2012 | 17 | Beth | 13.083333 |
13 | 3/6/2012 | 19 | Joe | 13.538461 |
14 | 3/7/2012 | 13 | Beth | 13.5 |
15 | 3/7/2012 | 16 | Sam | 13.666666 |
As you can see, the latest versions of OVER clause supports powerful yet simple aggregations which can help in a multitude of reporting and business solutions. Up next, the last blog in the series – Analytic Functions which are all new in SQL Server 2012.