June 9, 2006 at 12:43 pm
I'm slogging through some really, really bad legacy SQL, trying to get it polished and set up for 2005 and I'm about to go insane.
Can anyone think of any legitimate reason why a Group By clause would be included when there is no aggregate in the select?
Here's the full query (names changed to protect the innocent):
select
the_id
from
the_lot
group
by
the_id
June 9, 2006 at 12:58 pm
They were probably thinking there were dupes and wanted only a distinct list. (I have a bunch of Access users that are trying to write T-SQL and they do this all of the time)
This would give the same thing:
select distinct the_id from the_lot
HTH,
Michelle
June 9, 2006 at 12:58 pm
Using without an aggregate causes the resultset to behave like DISTINCT, and is faster.
N 56°04'39.16"
E 12°55'05.25"
June 9, 2006 at 1:04 pm
Yanno, I thought it might be something like that. So, I took a look at the query execution plan for each and they are exactly the same.
Odd.
June 9, 2006 at 1:32 pm
>>to behave like DISTINCT, and is faster.
Not in my experience, and not according to several postings by MVPs in the SQL Server newsgroup.
If I run a query plan on a query with SELECT DISTINCT, I see this in the step that eliminates dupes:
"Hash Match/Aggregate
Insert each input row into a hash table, grouping on the GROUP BY columns and evaluating aggregate expressions"
This is the execution plan, even though the query has no GROUP BY and no aggregates. This correlates to what I've read elswhere, DISTINCT and GROUP BY are evaluated the same way "under the hood" and there is no performance difference.
So it comes down to personal preference or company standards. I like code to be self-documenting so I know wth I was doing when I have to maintain it a year from now. So if the intention is to remove duplicate rows, I use DISTINCT to make it clear what the intention of the code is, rather than scratching my head wondering if someone forgot to add an aggregate expression.
June 9, 2006 at 2:34 pm
In most DBMSs, a GROUP BY is a more efficient way to eliminate duplicate rows compared with the DISTINCT keyword. The reason for this is that a GROUP BY invokes the sort required to find the duplicates earlier in the processing of the query, while a DISTINCT applies the sort as the very last step (applied to the final result set). The sooner the duplicate rows are eliminated, the more efficiently the remainder of the processing on that result set can be performed.
N 56°04'39.16"
E 12°55'05.25"
June 9, 2006 at 8:11 pm
Not really... as previously pointed out, the execution plans are identical... and so are their execution times on a quiet non-networked server... here's my test results on a million row test...
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
(676 row(s) affected)
1.813 Seconds duration for "DISTINCT
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
(676 row(s) affected)
1.813 Seconds duration for "GROUP BY
Here's the table I create to do such tests...
--=============================================================================================================== -- Setup --===============================================================================================================
--===== Create and populate a million row test table to test all sorts of data with. -- (This is my standard test table and it can be added to if necessary) SELECT TOP 1000000 IDENTITY(INT,1,1) AS RowNum, CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT) AS UserID, CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS SomeValue, --Two letters randomly selected 'A column for kicks' AS Kicks, 'Still another column just for proofing' AS StillAnother, CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY) AS SomeNumber, CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME) AS ADate --(>=01/01/2000 <01/01/2010) INTO dbo.BigTest FROM Master.dbo.SysColumns sc1, Master.dbo.SysColumns sc2 GO --===== A table is not properly formed unless a Primary Key has been assigned -- (Other indexes may be added for testing purposes) ALTER TABLE dbo.BigTest ADD PRIMARY KEY NONCLUSTERED (RowNum) GO
And, of course, here's the test code...
--===== Declare some duration variables DECLARE @StartTime DATETIME DECLARE @EndTime DATETIME
--=============================================================================================================== -- Run the test on DISTINCT to find unique values --=============================================================================================================== --===== Clear anything that might be cached DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE
--===== Start the duration timer SET @StartTime = GETDATE()
--===== Run the test (make sure you are in the "GRID MODE" SELECT DISTINCT SomeValue FROM dbo.BigTest WITH (NOLOCK)
--===== Stop the duration timer SET @EndTime = GETDATE()
--===== Display the duration in decimal seconds PRINT STR(DATEDIFF(ms,@StartTime,@EndTime)/1000.0,10,3) + ' Seconds duration for "DISTINCT'
--=============================================================================================================== -- Run the test on GROUP BY to find unique values --=============================================================================================================== --===== Clear anything that might be cached DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE
--===== Start the duration timer SET @StartTime = GETDATE()
--===== Run the test (make sure you are in the "GRID MODE" SELECT SomeValue FROM dbo.BigTest WITH (NOLOCK) GROUP BY SomeValue
--===== Stop the duration timer SET @EndTime = GETDATE()
--===== Display the duration in decimal seconds PRINT STR(DATEDIFF(ms,@StartTime,@EndTime)/1000.0,10,3) + ' Seconds duration for "GROUP BY'
Maybe it's just me but I'm thinking DISTINCT and GROUP BY are identical in performance and function when used to find unique items...
--Jeff Moden
Change is inevitable... Change for the better is not.
June 9, 2006 at 8:22 pm
I like code to be self-documenting so I know wth I was doing when I have to maintain it a year from now. So if the intention is to remove duplicate rows, I use DISTINCT to make it clear what the intention of the code is, rather than scratching my head wondering if someone forgot to add an aggregate expression. |
ABSOLUTELY CONCUR WITH THAT! Well done!
--Jeff Moden
Change is inevitable... Change for the better is not.
June 10, 2006 at 4:15 am
Great script code for testing!
Is there a difference when having index on the table? Clustered/non-clustered? With/without NOLOCK hint?
Running your test-script without any modification, I got following times (50 executions)
2.910-2.976 seconds duration for "DISTINCT (min/max times)
2.833-2.914 seconds duration for "GROUP BY (min/max times)
Adding clustered index on SomeValue (50 executions)
1.093-1.155 seconds duration for "DISTINCT (min/max times)
0.906-1.013 seconds duration for "GROUP BY (min/max times)
Adding non-clustered index on SomeValue (50 executions)
0.826-0.976 seconds duration for "DISTINCT (min/max times)
0.906-0.991 seconds duration for "GROUP BY (min/max times)
It seems to me that without any index or with a clustered index on SomeValue, GROUP BY is faster than DISTINCT.
ALL EXECUTIONS HAD EXACTLY THE SAME IDENTICAL EXECUTION PLAN.
BUT
having a non-clustered index on SomeValue made DISTINCT faster than GROUP BY
Adding/removing WITH NOLOCK made minimal impact. Removing WITH NOLOCK just added 3-7 ms.
Running MS SQL Server 2000 SP4 Developer Edition version 8.00.2039
Fujitsu Siemens E-series Lifebook
2.0 GHz P4 processor, 1 GB memory.
Windows XP SP2.
N 56°04'39.16"
E 12°55'05.25"
June 10, 2006 at 8:40 am
Wow! Nice job of testing! That's some awesome feedback, Peter.
It also goes to show that the machine can make a bit of a difference, as well. My testing was done on the following (just for comparison purposes, folks)...
MS SQL Server 2000 SP3a Developer's Edition on Windows XP SP2+
1.8 GHz P5, 2 GB memory
IDE drives
Log and Data files on same drive (it's a home machine )
Thanks again for the feedback, Peter.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 10, 2006 at 9:30 am
You're welcome, Jeff. Too bad I don't have my old computer with me that had SQL2KSP3 installed to compare times with.
N 56°04'39.16"
E 12°55'05.25"
June 10, 2006 at 5:53 pm
Awesome information, guys!
Many, many thanks.
Just so you know, given the query and what it was doing, I changed it to DISTINCT. If nothing else, it will help the developer that comes after me.
June 10, 2006 at 10:58 pm
If nothing else, it will help the developer that comes after me. |
I wish more folks programmed with that thought in mind. Thank you for the feedback, Pam.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 11, 2006 at 5:37 am
Jeff, any idea why DISTINCT behaved significantly different in times using either clustered/non-clustered index and why GROUP BY didn't change times much?
My gut tells my that a clustered index should be faster, but it isn't.
I have reproduced the executions on a SQL2005 server today, with similar results.
N 56°04'39.16"
E 12°55'05.25"
June 11, 2006 at 3:38 pm
Nope, sorry... I'm afraid only the boys in Redmond know the answer to that one (Group By/Distinct time changes) although it does make one question the myth that clustered indexes are always faster ...
So far as clustered indexes go, I've always been leary of them even in a low transaction scenario... seems to me that the query has to sift through the data pages to get at the index especially on wide tables... there are exceptions, of course, but I've found that non-clustered indexes will, many times, outperform clustered indexes because non-clustered indexes contain only index information... more row info per 8192 byte page/ fewer reads, I suppose. Lot's of times, I'll force the primary key to be a non-clustered index for the sake of speed depending on the majority of big queries that have to hit the table.
A great exception to my "use non-clustered primary keys" rule is a "Tally" or "Numbers" table which only has one column in it.
The bottom line is testing means everything...
OK, Peter, duck... here they come
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply