What does ORDER BY do with a MAX() OVER PARTITION BY window function call?

  • Hi All

    SQL Server 2012. Apologies if this is well trodden ground, but a quick google search and forum search here has come up empty so far. I'm trying to figure out why, when using the MAX() OVER window function, an ORDER BY clause is even necessary, let alone why it impacts the results. I have the following example:

    CREATE TABLE #test

    (

    [GroupName] [varchar](100) NULL,

    [Value] [int] NULL,

    [sortVal] [int] NULL

    ) ON [PRIMARY]

    insert into #test

    values ('a',0,5)

    ,('a',3,10)

    ,('a',2,15)

    ,('b',2,0)

    ,('b',4,20)

    ,('b',6,5)

    ,('c',4,2)

    ,('c',3,4)

    select * from #test order by groupname

    select

    groupname

    ,value

    ,sortVal

    ,max(value) over(partition by groupname order by sortVal) as max_By_sortVal

    ,max(value) over(partition by groupname) as max_unsorted

    from #test

    order by groupname, sortVal

    drop table #test

    With the above code I get the following two result sets:

    Resultset 1:

    GroupNameValuesortVal

    a05

    a310

    a215

    b20

    b420

    b65

    c42

    c34

    Resultset 2:

    groupnamevaluesortValmax_By_sortValmax_unsorted

    a0503

    a31033

    a21533

    b2026

    b6566

    b42066

    c4244

    c3444

    I would expect the max(value) over (partition by groupname...) logic to return the same value in each row within a partition group, whether the order by clause is used or not. The max value for group a, for example, is 3. That's what I get in the max_unsorted column, which has no order by, but in the max_By_sortVal column (max(value) over(partition by groupname order by sortVal)) the value returned is 0 in the first row, 3 in the subsequent two rows.

    I didn't see anything in the MSDN documentation that explains this behavior (nevermind that MSDN says ORDER BY is mandatory for the max() over window function, which it clearly is not). I get why ORDER BY is needed for things like LAG or even SUM where you want to build running totals, but why does it impact max() and min() which (to my feeble brain at least) should be agnostic to the ordering of the values in a group? Is this behavior a documented feature of the max() over functionality? Does anyone have a link to something that covers this? I was able to write code for my real-world use case that works using this logic, but we kind of need to know that it's a FeatureNotABug, so to speak, so that we know we can rely on consistent behavior from this logic?

    Thanks

  • Aggregate functions that do not use ORDER BY in the OVER clause (so that would be OVER(PARTITION BY ...) or just OVER (), which implies a single partition for the entire resultset) apply on all the values in the partition. From your post and question, you already knew that.

    Aggregate functions that add ORDER BY and a "row or range" clause change the window used. For each row, the partition is determined (based on the PARTITION BY), then a subset of that partition is used based on the "row or range" clause. The most common "row or range" clause is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, or the short version of this: ROWS UNBOUDNED PRECEDING; this can be used for running totals (with SUM; with MAX you'd get a running maximum). When specifying a "rows or range" clause, an ORDER BY clause is also required in order to impose an order on the partition - without it the resultset is an unsorded collection and "preceding" and "following" are measningless.

    Because an ORDER BY is required for the "rows or range" clause, omitting that clause and still including an ORDER BY clause will invoke a default "rows or range" clause: "RANGE BETWEEEN UNBOUNDED PRECEDING AND CURRENT ROW". Note that the first word here is RANGE. not ROWS. In my entire career, I have not yet encountered a single situation where this was the desired behaviour. In all cases, it was either functionally equivalent to ROWS (but lots slower, because RANGE introduces overhead), or it was different resulting in unexpected errors.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Thanks Hugo, that definitely helps. I need to try a couple of tweaks with this information in mind to make sure I'm getting consistent results (and maybe better throughput too).

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply