A Window Function question...

Question

A Window Function question...

kevaburg

SSCoach

Points: 18131
More actions
September 28, 2017 at 8:38 am

#402540

Hi,
I am working with Window Functions and am trying to understand their behaviour a bit better. The database I am using is the TSQL2012 database from Itzik Ben-Gans book "Inside Microsoft SQL Server 2008: T-SQL Querying". I am running SQL Server 2017.
By the way......the book is awesome! 🙂
Sorry the code isn't copyable. My browser keeps crashing when I try to copy and paste code from SSMS.
This is my table:
create table sales.orders ( orderid bigint, custid int )
The table is filled with data whereby each customer has several orderids against their unique custid.
I want to find the highest order number for each custid. This query produces the required results:
The result set:
So far so good. Now I want to rewrite the query to take advantage of a Window Function:
To me the query looks good but I get the following result set:
I can't understand why I am getting a different result set from this query.
Can someone explain what I am doing wrong?
Regards,
Kev

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 1

Kev

The second query has no WHERE clause, so you'll get one row in your result set for every row in the table. The first query filters out rows so that you are only left with the highest orderid per customer.

John

Luis Cazares SSC Guru Points: 183706 More actions · Answer 2

The first query is simulating an aggregate function. It's not returning all rows, just one for each customer. The second query will return one row per order.
You don't need group by when using window functions. But you would need to correct the group by and remove the OVER clause to get the correct result.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Thom A SSC Guru Points: 99130 More actions · Answer 3

I'm now sure why you want to use a Window Function for this task. Would a GROUP BY not achieve what you're looking for instead?
SELECT MAX(o.orderid) AS orderid, o.custid FROM Sales.Orders o GROUP BY o.custid ORDER BY o.custid ASC;
Using a Window Function, as you saw, will bring back lots of rows for a single customer because they have many orders; the Window function does not eliminate/group these, it simply let's you know the MAX order number for that customer on every row.

Edit:"Windows" Function, heh. :hehe:

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Eirikur Eiriksson SSC Guru Points: 182912 More actions · Answer 4

Quick thought, in the case of the aggregated results, there is no benefit of using Window functions.
😎

kevaburg SSCoach Points: 18131 More actions · Answer 5

Thom A - Thursday, September 28, 2017 9:07 AM
I'm now sure why you want to use a Window Function for this task. Would a GROUP BY not achieve what you're looking for instead?
SELECT MAX(o.orderid) AS orderid, o.custid FROM Sales.Orders o GROUP BY o.custid ORDER BY o.custid ASC;
Using a Window Function, as you saw, will bring back lots of rows for a single customer because they have many orders; the Window function does not eliminate/group these, it simply let's you know the MAX order number for that customer on every row.
Edit:"Windows" Function, heh. :hehe:

sgmunson SSC Guru Points: 110640 More actions · Answer 6

Much simpler query:
SELECT MAX(o.orderid) OVER(PARTITION BY o.custid ORDER BY o.orderid) AS LastOrderId, o.custid FROM Sales.Orders AS o

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

kevaburg SSCoach Points: 18131 More actions · Answer 7

So, this is how I solved it first of all....

selectdistinct(max(o.orderid) over (partition by o.custid order by o.custid)) as LastOrderId ,o.custid

from Sales.Orders o

order by o.custid asc

;

It works but when I look at the execution plan.....

I know that there are many of you out there that cringed when they saw the distinct but I wanted to see the difference between the Window Function and a normal GROUP BY:

SELECT

MAX(o.orderid) AS orderid,

o.custid

FROM Sales.Orders o

GROUP BY o.custid

ORDER BY o.custid ASC;

Although it is still a scan and not a seek, the improvement is clear and this was what I was looking for.

@Eirikur: The majority of examples I have seen have used aggregate functions and that is something I wanted to understand a bit more. I know that there are some pretty elegant Window Functions available (LEAD, LAG and so on) but if it can be used then I don't make the assumption that it should be used or that there isn't a better solution. That is the reason why I wanted to get a better look at this particular problem. But now my question: If there is no benefit to generating aggregates using Window Functions then where do the real benefits lie?

@sgmunson: This was my first solution (although I put a GROUP BY in it) but it didn't produce the required results; the highest orderid that a single custid has (see my very first post). The post resulted from my confusion about why it didn't produce the required results.

Thom A SSC Guru Points: 99130 More actions · Answer 8

kevaburg - Friday, September 29, 2017 4:22 AM

Although it is still a scan and not a seek, the improvement is clear and this was what I was looking for.

That's to be expected. There's no filtering on your query, so SQL Server has to scan the entire index to get all the results. On the other hand, if you add a WHERE then it changes to a seek. For example:
USE Sandbox; GO

CREATE TABLE orders (orderid int PRIMARY KEY CLUSTERED, custid int); GO CREATE INDEX idx_custid ON orders (custid ASC);

GO INSERT INTO orders VALUES (12,324), (17,324), (123,324), (435,324), (546,123), (153,123), (654,123), (14,22), (234,22), (54,65), (6573,6), (489,6), (694,6), (21,567), (23,567); GO SELECT * FROM orders;

SELECT * FROM orders WHERE orderid = 23;

SELECT MAX(orderid) AS orderid, custid FROM orders GROUP BY custid;

SELECT MAX(orderid) AS orderid, custid FROM orders WHERE custid > 100 GROUP BY custid; GO DROP TABLE orders; GO
(BE CAREFUL, I have a drop clause in there).

Notice, the first query (just the Select *) uses a Index Scan on the Clustered Index. The second, a Seek on the Clustered Index. The latter two use a Scan and Seek respectively as well on the NonClustered Index. This is due to the 2 queries with a Seek having a predicate on one of the columns. Without the predicate in both queries SQL Server has to get the result for every row; hence the Scan.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

kevaburg SSCoach Points: 18131 More actions · Answer 9

Thom A - Friday, September 29, 2017 4:42 AM

kevaburg - Friday, September 29, 2017 4:22 AM
Although it is still a scan and not a seek, the improvement is clear and this was what I was looking for.

That's to be expected. There's no filtering on your query, so SQL Server has to scan the entire index to get all the results. On the other hand, if you add a WHERE then it changes to a seek. For example:
USE Sandbox; GO

CREATE TABLE orders (orderid int PRIMARY KEY CLUSTERED, custid int); GO CREATE INDEX idx_custid ON orders (custid ASC);

GO INSERT INTO orders VALUES (12,324), (17,324), (123,324), (435,324), (546,123), (153,123), (654,123), (14,22), (234,22), (54,65), (6573,6), (489,6), (694,6), (21,567), (23,567); GO SELECT * FROM orders;

SELECT * FROM orders WHERE orderid = 23;

SELECT MAX(orderid) AS orderid, custid FROM orders GROUP BY custid;

SELECT MAX(orderid) AS orderid, custid FROM orders WHERE custid > 100 GROUP BY custid; GO DROP TABLE orders; GO
(BE CAREFUL, I have a drop clause in there).

Notice, the first query (just the Select *) uses a Index Scan on the Clustered Index. The second, a Seek on the Clustered Index. The latter two use a Scan and Seek respectively as well on the NonClustered Index. This is due to the 2 queries with a Seek having a predicate on one of the columns. Without the predicate in both queries SQL Server has to get the result for every row; hence the Scan.

OK, I understand that. With the production tables I have worked with it was rarely the case that we retrieved the entire table in a query so in producing a subset of data with a sensible index on it I would see a seek rather than a scan. Certainly something to think about when considering how much data is actually required.

whenriksen SSCarpal Tunnel Points: 4752 More actions · Answer 10

kevaburg - Thursday, September 28, 2017 8:38 AM
I can't understand why I am getting a different result set from this query.
Can someone explain what I am doing wrong?
Regards,
Kev

It may help you to see what you are doing wrong by including o.orderid in your columns selected. The query you tried essentialy says "for every unique combination of custid & orderid, get me the most recent orderid". Each unique combination is looking into a "window" for the most recent orderid for the customer.

Wes
(A solid design is always preferable to a creative workaround)