Let’s get this out of the way right now: I am SO going to take advantage of this T-SQL Tuesday on aggregates to talk about OVER() and PARTITION BY. Aggregates and OVER go together like french fries and catsup…sure, you can have them apart, but together they’re just magical.
Note: If you’re not at all familiar with OVER (PARTITION BY) at all, take a quick look at my brief intro, Tip: Over and Partition By.
In my T-SQL Brush-up session (also known as the Forgotten SQL session), I walk my way through fairly random examples set in the AdventureWorks database, and miraculously end up somewhere sensible.
(Tons more good talk and code after the jump…)
Select With Aggregates
Let’s start with a simple SELECT statement, chock full of aggregates:
/*****************************************************************************
------ Aggregates ------
This example gets the count, min/max/avg cost, and category of Products.
****************************************************************************/
SELECT
COUNT(*) ItemCount ,
MIN(P.ListPrice) MinCost ,
MAX(P.ListPrice) MaxCost ,
AVG(P.ListPrice) AvgCost ,
SC.Name [Category]
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
WHERE P.ProductSubCategoryID IS NOT NULL
AND P.Color IS NOT NULL
GROUP BY SC.Name
ORDER BY SC.Name;
That looks easy enough! We’re just getting some item COUNT, MIN/MAX/AVG cost action from our products table.
Rows and Aggregations, Side by Side
But we need a query for a report on individual items, not a summary. We want all the individual data alongside the aggregate data for each group. Enter the OVER (PARTITION BY) clause!
/*******************************************************************************
** OVER allows you to add aggregate columns within a query, without a GROUP BY. **
*******************************************************************************/
SELECT
P.ListPrice ,
COUNT(*) OVER ( PARTITION BY SC.Name ) [Item Count Per Category] , -- How many items IN EACH GROUP (SC.Name is category)
MIN(P.ListPrice) OVER ( PARTITION BY SC.Name ) [Min Cost Per Category] , -- Minimum price IN EACH GROUP
MAX(P.ListPrice) OVER ( PARTITION BY SC.Name) [Max Cost Per Category], -- Maximum price IN EACH GROUP
AVG(P.ListPrice) OVER ( ) [Average Cost Overall], -- Average price over the whole rowset
Count(P.ListPrice) OVER ( ) [Count Overall] , -- Count of items over the whole rowset
SC.Name [Category] ,
P.ProductNumber
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
WHERE P.ProductSubCategoryID IS NOT NULL
AND P.Color IS NOT NULL;
A partial resultset looks like this:
ListPrice | Item Count Per Category | Min Cost Per Category | Max Cost Per Category | Average Cost Overall | Count Overall | Category | ProductNumber |
89.99 | 3 | 89.99 | 89.99 | 885.3284 | 245 | Bib-Shorts | SB-M891-S |
89.99 | 3 | 89.99 | 89.99 | 885.3284 | 245 | Bib-Shorts | SB-M891-M |
89.99 | 3 | 89.99 | 89.99 | 885.3284 | 245 | Bib-Shorts | SB-M891-L |
106.5 | 2 | 106.5 | 106.5 | 885.3284 | 245 | Brakes | RB-9231 |
106.5 | 2 | 106.5 | 106.5 | 885.3284 | 245 | Brakes | FB-9873 |
Notice that this query has no GROUP BY, because we want to see all the individual prodcts. This is, of course, very cool and convenient.
Ranking with Row_Number
Now, let’s melt the cheddar cheesy goodness of Row_Number over the top of this query…we’ll rank each product based on price, within its subcategory. So for example, all the gloves will be ranked from cheapest to most expensive, 1 to however many gloves we sell:
/*****************************************************************************
------ Ranking ------
******************************************************************************/
SELECT
ROW_NUMBER() OVER ( PARTITION BY SC.Name ORDER BY P.ListPrice ) RankNum ,
COUNT(*) OVER ( PARTITION BY SC.Name ) [Count],
P.ProductNumber,
P.ListPrice ,
SC.Name [Category]
FROM Production.Product P
JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
WHERE P.ProductSubCategoryID IS NOT NULL
AND P.Color IS NOT NULL;
This gives us a rowset like this (partial):
RankNum | Count | ProductNumber | ListPrice | Category |
1 | 3 | CS-4759 | 175.49 | Cranksets |
2 | 3 | CS-6583 | 256.49 | Cranksets |
3 | 3 | CS-9183 | 404.99 | Cranksets |
1 | 2 | FD-2342 | 91.49 | Derailleurs |
2 | 2 | RD-2308 | 121.46 | Derailleurs |
1 | 6 | GL-H102-S | 24.49 | Gloves |
2 | 6 | GL-H102-M | 24.49 | Gloves |
Let’s take a quick step back. What we have here is a list of all products, grouped by category and ranked from cheapest to most expensive (within category). What could we do with that kind of information? Well, how about we put together a package deal for the salespeople to use….say, a “Frugal Shopper Special” of the least expensive item from each category we sell!
CTE FTW
There’s no way to ferret out this information with a basic SELECT-FROM-WHERE query. You can’t say “WHERE Row_Number() = 1″ (because windowed functions can only appear in the SELECT or ORDER BY), or “WHERE RankNum = 1″ (because column level aliases can only appear in the ORDER BY clause). We have to make the SQL engine pretend that this rowset is a persisted table. While, yes, you could load this data into a temporary table or table variable, we know (in our particular circumstance) that we won’t need to access that data again. Why not use a CTE?
Note: For an intro to CTEs, see N Things Worth Knowing About CTEs.
We stick the entire query from above into a CTE, and just select all columns WHERE RankNum = 1 (that is, for the cheapest item in each category):
WITH RankedProducts ( RankNum, Name, ProductNumber, Color, ListPrice, Category, StandardCost )
AS ( SELECT ROW_NUMBER() OVER ( PARTITION BY SC.Name ORDER BY P.ListPrice ) RankNum ,
P.ProductNumber ,
P.ListPrice ,
SC.Name [Category],
StandardCost
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
WHERE P.ProductSubCategoryID IS NOT NULL
AND P.Color IS NOT NULL
)
SELECT ProductNumber ,
ListPrice ,
Category ,
SUM(StandardCost) OVER () TotalStandardCost,
SUM(ListPrice) OVER ( ) TotalCost
FROM RankedProducts
WHERE RankNum = 1
ORDER BY Category;
And you’ll notice there at the end, we added the sum of standard cost and of list price for the entire set…those numbers mean “sum of what it costs to make these items” and “sum of what we normally sell these items for”. Full result set:
ProductNumber | ListPrice | Category | TotalStandardCost | TotalCost |
SB-M891-S | 89.99 | Bib-Shorts | 2120.3968 | 3833.625 |
RB-9231 | 106.5 | Brakes | 2120.3968 | 3833.625 |
CA-1098 | 8.99 | Caps | 2120.3968 | 3833.625 |
CH-0234 | 20.24 | Chains | 2120.3968 | 3833.625 |
CS-4759 | 175.49 | Cranksets | 2120.3968 | 3833.625 |
FD-2342 | 91.49 | Derailleurs | 2120.3968 | 3833.625 |
GL-H102-S | 24.49 | Gloves | 2120.3968 | 3833.625 |
HL-U509-R | 34.99 | Helmets | 2120.3968 | 3833.625 |
HY-1023-70 | 54.99 | Hydration Packs | 2120.3968 | 3833.625 |
LJ-0192-S | 49.99 | Jerseys | 2120.3968 | 3833.625 |
BK-M18B-40 | 539.99 | Mountain Bikes | 2120.3968 | 3833.625 |
FR-M21B-40 | 249.79 | Mountain Frames | 2120.3968 | 3833.625 |
PA-T100 | 125 | Panniers | 2120.3968 | 3833.625 |
PD-M282 | 40.49 | Pedals | 2120.3968 | 3833.625 |
BK-R19B-44 | 539.99 | Road Bikes | 2120.3968 | 3833.625 |
FR-R38B-58 | 337.22 | Road Frames | 2120.3968 | 3833.625 |
SH-M897-M | 59.99 | Shorts | 2120.3968 | 3833.625 |
SO-R809-M | 8.99 | Socks | 2120.3968 | 3833.625 |
TG-W091-S | 74.99 | Tights | 2120.3968 | 3833.625 |
BK-T18U-54 | 742.35 | Touring Bikes | 2120.3968 | 3833.625 |
FR-T67U-50 | 333.42 | Touring Frames | 2120.3968 | 3833.625 |
VE-C304-S | 63.5 | Vests | 2120.3968 | 3833.625 |
FW-M423 | 60.745 | Wheels | 2120.3968 | 3833.625 |
Now the sales folk have which items to put in the package, and the current profit margin (so they can play around with discounts). It is, in short, a delicious heap of steaming potato-, catsup-, and cheesy-goodness. Okay, now I want fries…
For more on these particular examples, see the recording of T-SQL Brush-up (requires PASS login), given at 24 Hours of PASS Fall 2010 (and download the demo code here).
Happy Tuesday!
Jen McCown