Performance of GROUP BY multiple columns

Question

Performance of GROUP BY multiple columns

Thomas Franz

Hall of Fame

Points: 3972
More actions
March 27, 2018 at 12:04 am

#336833

Comments posted to this topic are about the item Performance of GROUP BY multiple columns
God is real, unless declared integer.

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Carlo Romagnano SSC-Insane Points: 22854 More actions · Answer 1

In this only specific example, you are right, but I think that the correct answer is "Depend" on the order of the query because of the data cache.
The first and second query have the same execution plan, so the second query always runs better.
Try to change the order of execution of the queries and you'll see the difference.
The third query is missing over(PARTITION BY t.order_no)" for the COUNT. In any case it is the slowest one.

Thomas Franz Hall of Fame Points: 3972 More actions · Answer 2

You are right about the missing OVER() in the third query - sorry, my fault,

On the other side the execution plans for query one and two are not the same on my system, because it has to sort the second query by all columns in the GROUP BY, while it does not need this sort for the first query, regardless if I group by for the clustered index column(s) or some other random columns.

Data cache should not really matter, as I asked for the CPU costs (and executed the queries several times to have a hot cache)

God is real, unless declared integer.

Stewart "Arturius" Campbell SSC Guru Points: 72850 More actions · Answer 3

Nice one, despite the missing OVER(.....) in Method 3.
There are, however, instances where using windowing functions appear to be the best option to solve a specific issue, but the load is high, with heavy parallelism..

____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”

Aaron N. Cutshall SSCrazy Eights Points: 9094 More actions · Answer 4

Ignoring the missing OVER() clause, this is still a good example. I have seen way too many situations where DISTINCT was thrown into a query without regard to the consequences. What really kills me is a DISTINCT placed on an outer query when it was already used on an inner query -- totally unnecessary ignoring the fact that it shouldn't have been used in the first place. Whenever I see a query beginning with SELECT DISTINCT it tells me that the query is written poorly and DISTINCT is used to clean up the mess.

LinkedIn: https://www.linkedin.com/in/sqlrv
Website: https://www.sqlrv.com

Stewart "Arturius" Campbell SSC Guru Points: 72850 More actions · Answer 5

Aaron N. Cutshall - Tuesday, March 27, 2018 6:05 AM
Ignoring the missing OVER() clause, this is still a good example. I have seen way too many situations where DISTINCT was thrown into a query without regard to the consequences. What really kills me is a DISTINCT placed on an outer query when it was already used on an inner query -- totally unnecessary ignoring the fact that it shouldn't have been used in the first place. Whenever I see a query beginning with SELECT DISTINCT it tells me that the query is written poorly and DISTINCT is used to clean up the mess.

Concur in 99.999% of all cases...
The only case where i have found it necessary is where the data provided (usually by an external source, e.g. a web service) contains duplicates.
Even in those situations, i would rather use group by...

____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”

david.gugg SSCertifiable Points: 6214 More actions · Answer 6

I like this question, thanks for submitting it.

Be still, and know that I am God - Psalm 46:10

Aaron N. Cutshall SSCrazy Eights Points: 9094 More actions · Answer 7

Stewart "Arturius" Campbell - Tuesday, March 27, 2018 6:36 AM
Concur in 99.999% of all cases...
The only case where i have found it necessary is where the data provided (usually by an external source, e.g. a web service) contains duplicates.
Even in those situations, i would rather use group by...

True, but even then in those cases there are far more efficient methods of ignoring duplicates. In most cases duplicates can be identified on key fields. It's not very often that you have to compare each and every field like the DISTINCT operator does.

LinkedIn: https://www.linkedin.com/in/sqlrv
Website: https://www.sqlrv.com

Stewart "Arturius" Campbell SSC Guru Points: 72850 More actions · Answer 8

Aaron N. Cutshall - Tuesday, March 27, 2018 9:32 AM
True, but even then in those cases there are far more efficient methods of ignoring duplicates. In most cases duplicates can be identified on key fields. It's not very often that you have to compare each and every field like the DISTINCT operator does.

agreed...
recently had to work on a very badly performing proc. it was running a distinct in a subquery and the main query, both returning several million rows.
removing the distincts and adding group by's for the necessary columns drastically improved performance.

____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 9

A nice thought-provoking question!

The answer really does depend, though. Under some conditions the first two queries got the same query plan on my system (and performed equivalently), and under some circumstances they got different plans (sometimes with the performance difference intended by the question, sometimes not).

Still, it's interesting to think about and test. At least that the third option is substantially more costly seems clear; perhaps the question should have been "Which of these query forms should be avoided?" 🙂

Cheers!