What does ORDER BY do with a MAX() OVER PARTITION BY window function call?

Question

What does ORDER BY do with a MAX() OVER PARTITION BY window function call?

Frank The Tank

SSC Eights!

Points: 870
More actions
January 26, 2016 at 12:26 pm

#315320

Hi All
SQL Server 2012. Apologies if this is well trodden ground, but a quick google search and forum search here has come up empty so far. I'm trying to figure out why, when using the MAX() OVER window function, an ORDER BY clause is even necessary, let alone why it impacts the results. I have the following example:
CREATE TABLE #test
(
[GroupName] [varchar](100) NULL,
[Value] [int] NULL,
[sortVal] [int] NULL
) ON [PRIMARY]
insert into #test
values ('a',0,5)
,('a',3,10)
,('a',2,15)
,('b',2,0)
,('b',4,20)
,('b',6,5)
,('c',4,2)
,('c',3,4)
select * from #test order by groupname
select
groupname
,value
,sortVal
,max(value) over(partition by groupname order by sortVal) as max_By_sortVal
,max(value) over(partition by groupname) as max_unsorted
from #test
order by groupname, sortVal
drop table #test
With the above code I get the following two result sets:
Resultset 1:
GroupNameValuesortVal
a05
a310
a215
b20
b420
b65
c42
c34
Resultset 2:
groupnamevaluesortValmax_By_sortValmax_unsorted
a0503
a31033
a21533
b2026
b6566
b42066
c4244
c3444
I would expect the max(value) over (partition by groupname...) logic to return the same value in each row within a partition group, whether the order by clause is used or not. The max value for group a, for example, is 3. That's what I get in the max_unsorted column, which has no order by, but in the max_By_sortVal column (max(value) over(partition by groupname order by sortVal)) the value returned is 0 in the first row, 3 in the subsequent two rows.
I didn't see anything in the MSDN documentation that explains this behavior (nevermind that MSDN says ORDER BY is mandatory for the max() over window function, which it clearly is not). I get why ORDER BY is needed for things like LAG or even SUM where you want to build running totals, but why does it impact max() and min() which (to my feeble brain at least) should be agnostic to the ordering of the values in a group? Is this behavior a documented feature of the max() over functionality? Does anyone have a link to something that covers this? I was able to write code for my real-world use case that works using this logic, but we kind of need to know that it's a FeatureNotABug, so to speak, so that we know we can rely on consistent behavior from this logic?
Thanks

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Hugo Kornelis SSC Guru Points: 64790 More actions · Answer 1

Aggregate functions that do not use ORDER BY in the OVER clause (so that would be OVER(PARTITION BY ...) or just OVER (), which implies a single partition for the entire resultset) apply on all the values in the partition. From your post and question, you already knew that.

Aggregate functions that add ORDER BY and a "row or range" clause change the window used. For each row, the partition is determined (based on the PARTITION BY), then a subset of that partition is used based on the "row or range" clause. The most common "row or range" clause is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, or the short version of this: ROWS UNBOUDNED PRECEDING; this can be used for running totals (with SUM; with MAX you'd get a running maximum). When specifying a "rows or range" clause, an ORDER BY clause is also required in order to impose an order on the partition - without it the resultset is an unsorded collection and "preceding" and "following" are measningless.

Because an ORDER BY is required for the "rows or range" clause, omitting that clause and still including an ORDER BY clause will invoke a default "rows or range" clause: "RANGE BETWEEEN UNBOUNDED PRECEDING AND CURRENT ROW". Note that the first word here is RANGE. not ROWS. In my entire career, I have not yet encountered a single situation where this was the desired behaviour. In all cases, it was either functionally equivalent to ROWS (but lots slower, because RANGE introduces overhead), or it was different resulting in unexpected errors.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Frank The Tank SSC Eights! Points: 870 More actions · Answer 2

Thanks Hugo, that definitely helps. I need to try a couple of tweaks with this information in mind to make sure I'm getting consistent results (and maybe better throughput too).