October 16, 2019 at 2:28 pm
Hello Everyone
Thought I knew how Window functions worked in SQL Server but that was before I ran below script
This script is part of a CTE which is then fed into other CTEs downstream - but I am having hard time understanding how the MAX function here is deriving maximum date for each row
For example - ids 14 to 18 appears to be in one group with same max_date (2017-05-22 21:05:10.000) , ids 3 - 5 again have the same max date(2016-08-03 15:25:07.000) - not sure how the grouping works and what actually 'order by' is doing here
Script below
DROP TABLE IF EXISTS #test
CREATE TABLE #test(
[id] [int] IDENTITY(1,1) NOT NULL,
[customer_id] [int] NULL,
[start_Date] [datetime] NULL,
[end_date] [datetime] NULL
) ON [PRIMARY]
GO
SET IDENTITY_INSERT #test ON
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (1, 111, CAST(N'2016-05-04T22:34:38.000' AS DateTime), CAST(N'2016-07-03T22:34:38.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (2, 111, CAST(N'2016-05-05T08:51:08.000' AS DateTime), CAST(N'2016-08-03T08:51:08.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (3, 111, CAST(N'2016-05-05T15:25:07.000' AS DateTime), CAST(N'2016-08-03T15:25:07.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (4, 111, CAST(N'2016-05-05T15:25:40.000' AS DateTime), CAST(N'2016-05-19T15:25:40.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (5, 111, CAST(N'2016-05-06T08:50:23.000' AS DateTime), CAST(N'2016-06-05T08:50:23.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (6, 111, CAST(N'2017-05-16T20:22:44.000' AS DateTime), CAST(N'2017-07-15T20:22:44.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (7, 111, CAST(N'2017-07-18T15:15:42.000' AS DateTime), CAST(N'2017-08-01T15:15:42.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (8, 111, CAST(N'2017-07-18T15:16:29.000' AS DateTime), CAST(N'2017-07-25T15:16:29.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (9, 111, CAST(N'2017-07-26T10:16:46.000' AS DateTime), CAST(N'2017-10-24T10:16:46.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (10, 111, CAST(N'2018-09-27T11:19:13.000' AS DateTime), CAST(N'2018-10-04T11:19:13.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (11, 111, CAST(N'2019-01-30T10:16:46.000' AS DateTime), CAST(N'2019-02-06T10:16:46.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (12, 222, CAST(N'2016-12-20T16:32:09.000' AS DateTime), CAST(N'2017-03-20T16:32:09.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (13, 222, CAST(N'2016-12-20T16:33:39.000' AS DateTime), CAST(N'2017-01-19T16:33:39.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (14, 222, CAST(N'2017-02-21T21:05:10.000' AS DateTime), CAST(N'2017-05-22T21:05:10.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (15, 222, CAST(N'2017-02-21T21:09:25.000' AS DateTime), CAST(N'2017-03-07T21:09:25.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (16, 222, CAST(N'2017-02-28T13:47:14.000' AS DateTime), CAST(N'2017-04-29T13:47:14.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (17, 222, CAST(N'2017-04-04T11:10:20.000' AS DateTime), CAST(N'2017-05-04T11:10:20.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (18, 222, CAST(N'2017-04-04T11:10:20.000' AS DateTime), CAST(N'2017-05-04T11:10:20.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (19, 222, CAST(N'2017-10-26T15:25:36.000' AS DateTime), CAST(N'2017-11-25T15:25:36.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (20, 222, CAST(N'2018-04-12T09:29:28.000' AS DateTime), CAST(N'2018-04-19T09:29:28.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (21, 222, CAST(N'2018-04-12T09:43:29.000' AS DateTime), CAST(N'2018-07-11T09:43:29.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (22, 222, CAST(N'2018-08-02T23:10:08.000' AS DateTime), CAST(N'2018-09-01T23:10:08.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (23, 222, CAST(N'2019-08-20T11:15:03.000' AS DateTime), CAST(N'2019-09-19T11:15:03.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (24, 222, CAST(N'2019-10-15T15:34:53.000' AS DateTime), CAST(N'2019-10-22T15:34:53.000' AS DateTime))
INSERT #test ([id], [customer_id], [start_Date], [end_date]) VALUES (25, 222, CAST(N'2019-10-15T15:35:31.000' AS DateTime), CAST(N'2020-01-13T15:35:31.000' AS DateTime))
SET IDENTITY_INSERT #test OFF
SELECT * FROM #test AS t ORDER BY customer_id , t.start_Date , t.end_date
SELECT * , MAX(t.end_date) OVER (PARTITION BY t.customer_id ORDER BY t.start_Date ) AS max_date
FROM #test AS t
ORDER BY customer_id , t.start_Date , t.end_date
October 16, 2019 at 3:54 pm
if you compare the results with the ORDER BY t.start_Date and without, it might be clearer. Just having PARTITION BY t.customer_id will get the MAX(end_date) for each customer_id and return it for every row of that customer_id. The entire set that match the customer_id is the window regardless of which customer_id row you are looking at.
Having an ORDER BY in the OVER clause of MAX(end_date) causes it to evaluate only records before or up-to the current record in the function, so the window is growing for each row within the customer_id.
For your specific questions, since you ORDER BY start_date, which also seem to match order by id, then the following explanations hold true:
for customer_id = 111, id = 1 only has itself to compare end_date to so max_date = 2016-07-03 22:34:38.000
for customer_id = 111, id = 2, it has a later end_date than 1 so max_date = it's own end_date 2016-08-03 08:51:08.000
for customer_id = 111, id = 3, it has a later end_date than 1 or 2 so max_date = it's own end_date 2016-08-03 15:25:07.000
for customer_id = 111, id = 4, it has an earlier end_date than id = 3 so id = 3 is still the MAX(end_date) 2016-08-03 15:25:07.000
for customer_id = 111, id = 5, it has an earlier end_date than id = 3 so id = 3 is still the MAX(end_date) 2016-08-03 15:25:07.000
for customer_id = 111, id = 6, it has a later end_date than 1 or 2 or 3 or 4 so max_date - it's own end_date 2017-07-15 20:22:44.000
Same situation for id 14,15,16,17,18.
15,16,17,18 have later start_date than 14 and also have earlier end_date so the end_date of id = 14 is the MAX(end_date) for each of those rows.
Here's an article that describes in detail how they work:
https://www.red-gate.com/simple-talk/sql/t-sql-programming/introduction-to-t-sql-window-functions/
October 16, 2019 at 4:49 pm
Certain windowed functions require a frame when there is an ORDER BY
clause. MAX()
is one of those functions. So MAX(t.end_date) OVER (PARTITION BY t.customer_id ORDER BY t.start_Date ) AS max_date
is short for MAX(t.end_date) OVER (PARTITION BY t.customer_id ORDER BY t.start_Date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS max_date
.
Some things to note here:
RANGE
is deterministic, but ROWS
is not necessarily. This is why RANGE
is the default.RANGE
will often write to disk, but ROWS
will never write to disk. This is why ROWS
is generally preferred.RANGE
will consider all tied records, ROWS
will "arbitrarily" break the tie. (See 1.)Chris' statement below is inaccurate in that it ignores ties.
Having an ORDER BY in the OVER clause of MAX(end_date) causes it to evaluate only records before or up-to the current record in the function, so the window is growing for each row within the customer_id.
The differences between RANGE
and ROWS
all come down to the difference in how they process ties. Since ROWS
arbitrarily breaks ties, it only ever needs to consider two values: the value from the current row and the value of the aggregate from the previous row. Since it only ever needs to consider two values, it never writes to disk. Since RANGE
considers all records within the tied group, the number of records processed to calculate the current aggregate value can be arbitrarily large, because you can have an arbitrary number of tied records. RANGE
will write to disk when the number of values being considered reaches a certain threshold. (I believe it's 10,000 values).
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 17, 2019 at 8:13 am
Thanks Chris and Drew for your detailed explanation - it does make sense.
I guess I was just so hung up with the idea of some strange grouping - I did not think of it being something like running max
Thanks
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply