Group By with no aggregate: Why???

Question

Group By with no aggregate: Why???

Pam Brisjar

SSChampion

Points: 12094
More actions
June 9, 2006 at 12:43 pm

#117940

I'm slogging through some really, really bad legacy SQL, trying to get it polished and set up for 2005 and I'm about to go insane.
Can anyone think of any legitimate reason why a Group By clause would be included when there is no aggregate in the select?
Here's the full query (names changed to protect the innocent):
select

the_id
from

the_lot
group
by
the_id

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

mimorr SSCarpal Tunnel Points: 4850 More actions · Answer 1

They were probably thinking there were dupes and wanted only a distinct list. (I have a bunch of Access users that are trying to write T-SQL and they do this all of the time)

This would give the same thing:

select distinct the_id from the_lot

HTH,

Michelle

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 2

Using without an aggregate causes the resultset to behave like DISTINCT, and is faster.

N 56°04'39.16"
E 12°55'05.25"

Pam Brisjar SSChampion Points: 12094 More actions · Answer 3

Yanno, I thought it might be something like that. So, I took a look at the query execution plan for each and they are exactly the same.

Odd.

PW-201837 SSC-Insane Points: 20805 More actions · Answer 4

>>to behave like DISTINCT, and is faster.

Not in my experience, and not according to several postings by MVPs in the SQL Server newsgroup.

If I run a query plan on a query with SELECT DISTINCT, I see this in the step that eliminates dupes:

"Hash Match/Aggregate

Insert each input row into a hash table, grouping on the GROUP BY columns and evaluating aggregate expressions"

This is the execution plan, even though the query has no GROUP BY and no aggregates. This correlates to what I've read elswhere, DISTINCT and GROUP BY are evaluated the same way "under the hood" and there is no performance difference.

So it comes down to personal preference or company standards. I like code to be self-documenting so I know wth I was doing when I have to maintain it a year from now. So if the intention is to remove duplicate rows, I use DISTINCT to make it clear what the intention of the code is, rather than scratching my head wondering if someone forgot to add an aggregate expression.

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 5

In most DBMSs, a GROUP BY is a more efficient way to eliminate duplicate rows compared with the DISTINCT keyword. The reason for this is that a GROUP BY invokes the sort required to find the duplicates earlier in the processing of the query, while a DISTINCT applies the sort as the very last step (applied to the final result set). The sooner the duplicate rows are eliminated, the more efficiently the remainder of the processing on that result set can be performed.

N 56°04'39.16"
E 12°55'05.25"

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 6

Not really... as previously pointed out, the execution plans are identical... and so are their execution times on a quiet non-networked server... here's my test results on a million row test...

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

(676 row(s) affected)

1.813 Seconds duration for "DISTINCT

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

(676 row(s) affected)

1.813 Seconds duration for "GROUP BY

Here's the table I create to do such tests...

--===============================================================================================================
--      Setup
--===============================================================================================================

--===== Create and populate a million row test table to test all sorts of data with.
     -- (This is my standard test table and it can be added to if necessary)
 SELECT TOP 1000000
        IDENTITY(INT,1,1) AS RowNum,
        CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT) AS UserID,
        CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
      + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS SomeValue, --Two letters randomly selected
        'A column for kicks' AS Kicks,
        'Still another column just for proofing' AS StillAnother,
        CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY) AS SomeNumber,
        CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME) AS ADate --(>=01/01/2000  <01/01/2010)
   INTO dbo.BigTest
   FROM Master.dbo.SysColumns sc1,
        Master.dbo.SysColumns sc2
GO
--===== A table is not properly formed unless a Primary Key has been assigned
     -- (Other indexes may be added for testing purposes)
  ALTER TABLE dbo.BigTest
        ADD PRIMARY KEY NONCLUSTERED (RowNum)
GO

And, of course, here's the test code...

--===== Declare some duration variables
DECLARE @StartTime DATETIME
DECLARE @EndTime   DATETIME

--===============================================================================================================
--      Run the test on DISTINCT to find unique values
--===============================================================================================================
--===== Clear anything that might be cached
   DBCC DROPCLEANBUFFERS
   DBCC FREEPROCCACHE

--===== Start the duration timer
    SET @StartTime = GETDATE()

--===== Run the test (make sure you are in the "GRID MODE"
 SELECT DISTINCT SomeValue
   FROM dbo.BigTest WITH (NOLOCK)

--===== Stop the duration timer
    SET @EndTime = GETDATE()

--===== Display the duration in decimal seconds
  PRINT STR(DATEDIFF(ms,@StartTime,@EndTime)/1000.0,10,3) + ' Seconds duration for "DISTINCT'

--===============================================================================================================
--      Run the test on GROUP BY to find unique values
--===============================================================================================================
--===== Clear anything that might be cached
   DBCC DROPCLEANBUFFERS
   DBCC FREEPROCCACHE

--===== Start the duration timer
    SET @StartTime = GETDATE()

--===== Run the test (make sure you are in the "GRID MODE"
 SELECT SomeValue
   FROM dbo.BigTest WITH (NOLOCK)
  GROUP BY SomeValue

--===== Stop the duration timer
    SET @EndTime = GETDATE()

--===== Display the duration in decimal seconds
  PRINT STR(DATEDIFF(ms,@StartTime,@EndTime)/1000.0,10,3) + ' Seconds duration for "GROUP BY'

Maybe it's just me but I'm thinking DISTINCT and GROUP BY are identical in performance and function when used to find unique items...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 7

I like code to be self-documenting so I know wth I was doing when I have to maintain it a year from now. So if the intention is to remove duplicate rows, I use DISTINCT to make it clear what the intention of the code is, rather than scratching my head wondering if someone forgot to add an aggregate expression.

ABSOLUTELY CONCUR WITH THAT! Well done!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 8

Great script code for testing!

Is there a difference when having index on the table? Clustered/non-clustered? With/without NOLOCK hint?

Running your test-script without any modification, I got following times (50 executions)

2.910-2.976 seconds duration for "DISTINCT (min/max times)

2.833-2.914 seconds duration for "GROUP BY (min/max times)

Adding clustered index on SomeValue (50 executions)

1.093-1.155 seconds duration for "DISTINCT (min/max times)

0.906-1.013 seconds duration for "GROUP BY (min/max times)

Adding non-clustered index on SomeValue (50 executions)

0.826-0.976 seconds duration for "DISTINCT (min/max times)

0.906-0.991 seconds duration for "GROUP BY (min/max times)

It seems to me that without any index or with a clustered index on SomeValue, GROUP BY is faster than DISTINCT.

ALL EXECUTIONS HAD EXACTLY THE SAME IDENTICAL EXECUTION PLAN.

BUT

having a non-clustered index on SomeValue made DISTINCT faster than GROUP BY

Adding/removing WITH NOLOCK made minimal impact. Removing WITH NOLOCK just added 3-7 ms.

Running MS SQL Server 2000 SP4 Developer Edition version 8.00.2039

Fujitsu Siemens E-series Lifebook

2.0 GHz P4 processor, 1 GB memory.

Windows XP SP2.

N 56°04'39.16"
E 12°55'05.25"

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 9

Wow! Nice job of testing! That's some awesome feedback, Peter.

It also goes to show that the machine can make a bit of a difference, as well. My testing was done on the following (just for comparison purposes, folks)...

MS SQL Server 2000 SP3a Developer's Edition on Windows XP SP2+

1.8 GHz P5, 2 GB memory

IDE drives

Log and Data files on same drive (it's a home machine )

Thanks again for the feedback, Peter.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 10

You're welcome, Jeff. Too bad I don't have my old computer with me that had SQL2KSP3 installed to compare times with.

N 56°04'39.16"
E 12°55'05.25"

Pam Brisjar SSChampion Points: 12094 More actions · Answer 11

Awesome information, guys!

Many, many thanks.

Just so you know, given the query and what it was doing, I changed it to DISTINCT. If nothing else, it will help the developer that comes after me.

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 12

If nothing else, it will help the developer that comes after me.

I wish more folks programmed with that thought in mind. Thank you for the feedback, Pam.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 13

Jeff, any idea why DISTINCT behaved significantly different in times using either clustered/non-clustered index and why GROUP BY didn't change times much?

My gut tells my that a clustered index should be faster, but it isn't.

I have reproduced the executions on a SQL2005 server today, with similar results.

N 56°04'39.16"
E 12°55'05.25"

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 14

Nope, sorry... I'm afraid only the boys in Redmond know the answer to that one (Group By/Distinct time changes) although it does make one question the myth that clustered indexes are always faster ...

So far as clustered indexes go, I've always been leary of them even in a low transaction scenario... seems to me that the query has to sift through the data pages to get at the index especially on wide tables... there are exceptions, of course, but I've found that non-clustered indexes will, many times, outperform clustered indexes because non-clustered indexes contain only index information... more row info per 8192 byte page/ fewer reads, I suppose. Lot's of times, I'll force the primary key to be a non-clustered index for the sake of speed depending on the majority of big queries that have to hit the table.

A great exception to my "use non-clustered primary keys" rule is a "Tally" or "Numbers" table which only has one column in it.

The bottom line is testing means everything...

OK, Peter, duck... here they come

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)