May 18, 2010 at 2:21 pm
That is exactly my question as well. More fields usually means more time to execute, as far as I knew. But it's the opposite in this case!
May 18, 2010 at 4:53 pm
Just curious... how many rows do the tables actually have in them? The fact that DISTINCT is being used to overcome some obviously duplicated output leads me to believe there may be what I refer to as an "accidental cross" join better known as a "many-to-many" join... and they can absolutely crush performance.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 19, 2010 at 5:16 am
I freely admit I'm not smart enough to always figure out or understand why the optimizer did something. All I can say is what I see. The place where the optimizer does the aggregation is the main difference between these two operations. One of them deals with fewer rows for less time than the other and that looks like where the performance bottleneck is occurring. Personally, I'd concentrate on eliminating the DISTINCT clause. I'm with Jeff. I think that's the problem.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
May 19, 2010 at 6:33 am
Grant Fritchey (5/18/2010)
...while the other does a completely useless sort...
That sort on SchID is an optimisation for sequential I/O.
Look at the loop join after the sort operator - the clustered index seek on Events.
Presenting the driving rows in SchID order to the clustered index seek results in sequential I/O rather than the random I/O that would result if the rows were presented in Booking.LDate order (as they would be without the sort).
There's a lot more to say about this query, but I wanted to point the above out separately.
Paul
May 19, 2010 at 6:49 am
Paul White NZ (5/19/2010)
Grant Fritchey (5/18/2010)
...while the other does a completely useless sort...That sort on SchID is an optimisation for sequential I/O.
Look at the loop join after the sort operator - the clustered index seek on Events.
Presenting the driving rows in SchID order to the clustered index seek results in sequential I/O rather than the random I/O that would result if the rows were presented in Booking.LDate order (as they would be without the sort).
There's a lot more to say about this query, but I wanted to point the above out separately.
Paul
Thanks Paul. I live for your regular corrections of my statements.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
May 19, 2010 at 8:57 am
Wow, this is good stuff. Thanks to everyone for looking at this and offering suggestions! I knew SSC was the best place to go with my SQL questions...
May 19, 2010 at 9:45 am
tacy.highland (5/18/2010)
Same exact queries, only Query1 runs 1 second and has 2 columns which happen to be primary keys on the Events table, Query2 runs 20 seconds and does not include those 2 columns. Everything else is the same. Is this just a strange occurrence or does it make any kind of sense?
It makes sense 🙂
If you include the primary key column(s) for a table in the DISTINCT list, the query optimiser knows that it is pointless to do the DISTINCT on any columns that originate from that table; every row is already guaranteed to be unique! The optimiser can use uniqueness information like this wherever it comes from: a primary key, unique index, or unique constraint.
When you include EvId and SchId, the optimiser knows that every row from the Events table will already be DISTINCT. The same goes for the Clients table: you are SELECTing ClientID, which is presumably identifies each client row uniquely too.
So, the final DISTINCT only logically applies to columns coming from the Booking table. The optimiser can see that only two columns are needed from that table: LDate and SchID. LDate doesn't need DISTINCTing either since the query specifies LDate = '20100517' on both sides of the OR condition in the WHERE clause. That just leaves SchId.
The result of all this is that the complex-looking DISTINCT boils down to a simple GROUP BY on the SchId values that originate from the Booking table. The optimiser 'pushes' the GROUP BY on SchId as far down the plan as it can, to maximise the benefits of reducing the intermediate row count from that point forward. The rows from the index seek and key lookup on the Booking table do not come in SchId order, so the optimiser chooses a Hash Aggregate to do the GROUP BY rather than a sort plus Stream Aggregate.
Notice that although the sort would benefit the following seek on Events as I mentioned before, the optimiser estimates that the GROUP BY operation will reduce the row count enough to make that an inefficient optimisation: the sort would cost more time than it would save by converting random I/O to sequential I/O.
Paul
May 19, 2010 at 10:06 am
On the question of statistics:
These seem pretty good to me, in general, with one exception: the optimiser does not have good information about the correlation of SchId with LDate. It makes an educated guess at cardinality based on the separate single-column statistics available on the Booking table, but single-column statistics cannot show the tight correlation between the two columns. This results in a poor estimate of the number of rows that will be produced by the Hash Aggregate, which affects everything after that point in the plan.
One solution would be to create multi-column statistics on (LDate, SchId) on the Booking table, but I think you should go further and replace the idx_Booking_Date index with one on LDate and SchId: While you could simply add SchId as an INCLUDE column, that would not produce the multi-column statistics we need, and a seek on a value for LDate would not produce rows in SchId order (which can be useful as I mentioned earlier). For those two reasons, I would key that index on both LDate and SchId.
Paul
May 19, 2010 at 10:10 am
Thank you Paul! Very interesting and very well explained. It seems my suspicions were justified. I guess it does make some kind of sense to have the PKs in the select, so I'll have to keep this in mind when writing queries in the future. Typically, I had just been taught not to include fields you don't use but it's not quite that simple.
A follow up then: Would this apply in cases without a DISTINCT? Or is the gain only found when using the PKs with the DISTINCT?
May 19, 2010 at 10:48 am
tacy.highland (5/19/2010)
Thank you Paul! Very interesting and very well explained. It seems my suspicions were justified. I guess it does make some kind of sense to have the PKs in the select, so I'll have to keep this in mind when writing queries in the future. Typically, I had just been taught not to include fields you don't use but it's not quite that simple.
It's important to get the correct message from all this. I explained how the optimiser was able to rewrite your query to improve its performance - and that is quite different from saying that you should write queries with that aim in mind 😉
The optimiser is able to do some very clever stuff - but you should often take that as a hint that there is a better way to write the query - you must not rely on optimiser rewrites!
Notice that when the execution plan matched the written form of the query quite closely (with the DISTINCT right at the end) performance was very poor. The optimised query that did the grouping operation earlier performed much better. DISTINCT is often the sign of a sub-optimal query, as Jeff mentioned.
Remember that a DISTINCT in the SELECT clause is exactly the same as adding a GROUP BY on every (non-aggregated) column in the SELECT list. Your DISTINCT is the same as:
GROUP BY
dbo.Clients.ClientId,
dbo.Clients.ClientCode,
dbo.Clients.LastName,
dbo.Clients.FirstName,
dbo.Clients.MiddleName,
dbo.Events.SchId,
dbo.booking.ldate,
dbo.Events.EvId,
dbo.Events.EvChildOrder,
dbo.Events.EvParent,
dbo.Events.EvClass,
dbo.Events.SchTime,
dbo.Events.SchEarly,
dbo.Events.SchLate,
dbo.Events.BookingId,
dbo.Events.AddrName,
dbo.Events.ActualArriveTime,
dbo.Events.ActualDepartTime,
dbo.Events.DirectDistance,
dbo.Clients.Alert,
dbo.Events.AddressSummary,
dbo.Events.ReqLate
It's really crucial that you understand that is exactly the same as adding the DISTINCT. If your query had that clause, you might suspect there was a problem with the query rather than SQL Server. That is the reason many people dislike the DISTINCT keyword.
What you should do here is to rewrite the query in a form closer to the execution plan that ran quickly - doing the GROUP BY yourself on the Booking table, and referencing the result as a derived table or common table expression. If you are not sure what I mean, please ask and I will include some code to make it clearer.
What you have been taught about not including columns you don't need is correct. 😎
Paul
May 19, 2010 at 11:04 am
Demonstrating the derived table:
SELECT C.ClientId,
C.ClientCode,
C.LastName
C.FirstName,
RIGHT('0' + LTRIM(STR(E.SchTime/3600)),2)+RIGHT('0' + LTRIM(STR(E.SchTime%3600)/60),2) AS SchTime,
E.SchLate,
E.AddrName,
C.Alert,
RIGHT('0' + LTRIM(STR(E.ReqLate/3600)),2)+RIGHT('0' + LTRIM(STR(E.ReqLate%3600)/60),2) AS Reqlate
(E.SchLate - E.ActualArriveTime) / 60 AS LatePick,
(E.ReqLate - E.ActualArriveTime) / 60 AS LateAppt
FROM (
-- Derived table: grouping early!
SELECT LDate, SchId
FROM dbo.Booking
WHERE LDate = '20100517'
GROUP BY
LDate, SchId
-- End dervied table
) B
JOIN dbo.Events E ON E.SchId = B.SchId
JOIN dbo.Clients C ON C.ClientId = E.ClientId
WHERE C.Alert = 1
AND (
(
E.SchLate <> - 1
AND E.SchLate - E.ActualArriveTime < 0
)
OR
(
E.ReqLate <> - 1
AND E.ReqLate - E.ActualArriveTime < 0
)
)
ORDER BY
C.LastName ASC;
May 19, 2010 at 11:14 am
Ok, I definitely better understand the use of DISTINCT and how it really is GROUP BY. I guess I never really thought of it that way before, so thank you for that explanation. Thank you for the code as well, that definitely helped me understand what you were referencing a little better. So, tell me if I'm getting this right: I'd use the Booking/ldate as a derived table due to the fact that the WHERE clause has to parse out the information by date again in the OR statement, right? It's only pulling up the booking records for that specific date once, via that table, rather than having to reference the records in that table by date multiple times. (did i get that right?)
(This is good stuff! Love learning to do this better!)
May 19, 2010 at 11:38 am
tacy.highland (5/19/2010)
So, tell me if I'm getting this right: I'd use the Booking/ldate as a derived table due to the fact that the WHERE clause has to parse out the information by date again in the OR statement, right? It's only pulling up the booking records for that specific date once, via that table, rather than having to reference the records in that table by date multiple times.
Almost. The point is to remove the duplicate values for SchId before the join to the Events table.
If you run the code in the derived table on its own:
SELECT LDate, SchId
FROM dbo.Booking
WHERE LDate = '20100517'
GROUP BY
LDate, SchId;
You should find it produces two rows. These two rows then go on to join with the Events table.
If you run:
SELECT LDate, SchId
FROM dbo.Booking
WHERE LDate = '20100517';
It should produce 4,835 rows.
The performance benefits come from cutting the duplicates out as soon as possible, rather than doing all the joins and relying on a DISTINCT (or equivalent GROUP BY) at the end. Does that make sense?
I often encourage people to write queries step-by-step like this, adding one table at a time, to get a fuller appreciation for what is 'really' happening.
Paul
May 19, 2010 at 11:52 am
Excellent! Thank you! It's making more and more sense. Yet, there's so much more for me to learn...Since I'm picking up so many great things from you, I wonder, do you have any really good SQL books you could recommend?
I'm definitely going to be taking a closer look at this.
Thanks again for all your time!
May 19, 2010 at 12:00 pm
tacy.highland (5/19/2010)
Excellent! Thank you! It's making more and more sense. Yet, there's so much more for me to learn...Since I'm picking up so many great things from you, I wonder, do you have any really good SQL books you could recommend?
It does depend on your current level and the style of book you enjoy, but a couple of my favourites are:
http://www.solidq.com/insidetsql/books/tsqlfund2008/ (Beginner -> Intermediate)
http://www.solidq.com/insidetsql/books/insidetsql2008/ (Intermediate -> Advanced)
I'm definitely going to be taking a closer look at this.
Thanks again for all your time!
You're welcome.
Viewing 15 posts - 16 through 29 (of 29 total)
You must be logged in to reply to this topic. Login to reply