I looked at row_number() in a previous post. Now I want to build on that and do some counting of rows with COUNT() and the OVER clause. I’ll show how this differs a bit from a normal aggregate.
Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers. This is also part of a series on Window Functions.
The Raw Data
Let’s use AdventureWorks as a sample database. In the Sales.SalesOrderHeader table there are lots of customer orders, each with a Customer ID. I’ll limit my results to these customers, which has a spread of orders:
WHERE soh.CustomerID IN (11011, 11015, 11019, 11012)
We’ll use this WHERE clause to help us decide
Counting Orders
If I wanted to count the number of orders each customer hasplaced, I can use this code:
SELECT
soh.CustomerID
, COUNT (*) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
soh.CustomerID IN ( 11011, 11015, 11019, 11012 )
GROUP BY soh.CustomerID;
I see these results:
We can see that each of these customers has different numbers of orders. If I want a simple summary, this is the best code. However, what if I want to add other columns?
Supposed I want to add not only the customer ID, but also the shipdate. I’m wondering when orders were shipped. If I use the code above, I’d need to add this in the column list, and the group by. If I do this, I get this code:
SELECT
soh.CustomerID
, soh.ShipDate
, COUNT (*) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
soh.CustomerID IN ( 11011, 11015, 11019, 11012 )
GROUP BY
soh.CustomerID
, soh.ShipDate;
And this result:
The results are cut off, but they repeat with a 1 for each row. The front end can sum these, but it’s easy to make a mistake here. Especially if there are filters.
With an OVER() clause, my aggregate changes a bit. I would use this code, where I am paritioning, or essentially grouping, the data by the customer ID. I don’t have a group by, so I don’t need to add other fields in two places. Here is the code:
SELECT
soh.CustomerID
, soh.ShipDate
, COUNT (*) OVER (PARTITION BY soh.CustomerID) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
soh.CustomerID IN ( 11011, 11015, 11019, 11012 );
And the results.
I have the count repeated, but if I am showing this in the front end, I can hide that column in a table, but I can also access the total count from any row.
Differences with Group By
Let’s say I decide to add something else, like the account number and the SalesOrderNumber. If I do this in a GROUP BY, I need to add this to the column list and the group by, as shown here:
SELECT
soh.CustomerID
, soh.AccountNumber
, soh.SalesOrderNumber
, soh.ShipDate
, COUNT (*) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
soh.CustomerID IN ( 11011, 11015, 11019, 11012 )
GROUP BY
soh.CustomerID
, soh.ShipDate
, soh.AccountNumber
, soh.SalesOrderNumber;
GO
The change the Window function is as shown:
SELECT
soh.CustomerID
, soh.AccountNumber
, soh.SalesOrderNumber
, soh.ShipDate
, COUNT (*) OVER (PARTITION BY soh.CustomerID) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
soh.CustomerID IN ( 11011, 11015, 11019, 11012 );
Maintenance is easy, and what’s more, I don’t have to worry about ordering my group by in different ways. If I do care about grouping, then I can alter the partitioning, but often I just want to add other columns without making the code more complex in a second place.
The results for the group by have the fields, but all 1s in the count again.
The window function has the counts for each customer.
Small things, but those points of maintenance can be annoying and they can cause problems. For complex data, the other fields might not group well, or the order of grouping might change what is shown.
There are more things to be concerned about here, but one of the big things might be a running total, where I count the orders over time. I’d like results like this:
My window function code is:
SELECT
soh.CustomerID
, soh.AccountNumber
, soh.SalesOrderNumber
, soh.ShipDate
, COUNT (*) OVER (PARTITION BY soh.CustomerID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS OrderCount
FROM Sales.SalesOrderHeader AS soh
WHERE
soh.CustomerID IN ( 11011, 11015, 11019, 11012 );
The regular code with group by is this:
select soh1.customerid, soh1.AccountNumber, soh1.SalesOrderNumber, soh1.ShipDate
, count(*) 'running_total'
from sales.salesorderheader soh1
inner join sales.salesorderheader soh2 on soh2.salesorderid <= soh1.salesorderid
and soh2.customerid = soh1.customerid
group by soh1.customerid, soh1.AccountNumber, soh1.SalesOrderNumber, soh1.ShipDate
order by soh1.customerid, soh1.ShipDate
Not bad, but the big difference is in resources used. The statistics IO below shows the difference. Can you guess which is which?
Hint, the one using less reads is the OVER() code.
SQL New Blogger
Setting up a good scenario is a little tricky, and that took time. I messed with a few data sets that helped explain this to me, and you, in a way that made sense. I have a few other scenarios, which I will write about because an OVER() isn’t magic and it might not be the right choice.
Those will be other blogs.
This is a good way to showcase that you understand part of how this works. There’s a whole series to be written on different aspects of how the OVER() clause works. Spending 15-30 minutes each time you experiment helps showcase your knowledge and learning to employers.
That’s what employers need: experimenting and learning.