Group By with no aggregate: Why???

  • I'm slogging through some really, really bad legacy SQL, trying to get it polished and set up for 2005 and I'm about to go insane.

    Can anyone think of any legitimate reason why a Group By clause would be included when there is no aggregate in the select?

    Here's the full query (names changed to protect the innocent):

    select

    the_id

    from

    the_lot

    group

    by

    the_id

  • They were probably thinking there were dupes and wanted only a distinct list. (I have a bunch of Access users that are trying to write T-SQL and they do this all of the time)

    This would give the same thing:

    select distinct the_id from the_lot

    HTH,



    Michelle

  • Using without an aggregate causes the resultset to behave like DISTINCT, and is faster.


    N 56°04'39.16"
    E 12°55'05.25"

  • Yanno, I thought it might be something like that.  So, I took a look at the query execution plan for each and they are exactly the same.

    Odd.

  • >>to behave like DISTINCT, and is faster.

    Not in my experience, and not according to several postings by MVPs in the SQL Server newsgroup.

    If I run a query plan on a query with SELECT DISTINCT, I see this in the step that eliminates dupes:

    "Hash Match/Aggregate

    Insert each input row into a hash table, grouping on the GROUP BY columns and evaluating aggregate expressions"

    This is the execution plan, even though the query has no GROUP BY and no aggregates. This correlates to what I've read elswhere, DISTINCT and GROUP BY are evaluated the same way "under the hood" and there is no performance difference.

    So it comes down to personal preference or company standards. I like code to be self-documenting so I know wth I was doing when I have to maintain it a year from now. So if the intention is to remove duplicate rows, I use DISTINCT to make it clear what the intention of the code is, rather than scratching my head wondering if someone forgot to add an aggregate expression.

     

     

  • In most DBMSs, a GROUP BY is a more efficient way to eliminate duplicate rows compared with the DISTINCT keyword. The reason for this is that a GROUP BY invokes the sort required to find the duplicates earlier in the processing of the query, while a DISTINCT applies the sort as the very last step (applied to the final result set). The sooner the duplicate rows are eliminated, the more efficiently the remainder of the processing on that result set can be performed.


    N 56°04'39.16"
    E 12°55'05.25"

  • Not really... as previously pointed out, the execution plans are identical... and so are their execution times on a quiet non-networked server... here's my test results on a million row test...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (676 row(s) affected)

         1.813 Seconds duration for "DISTINCT

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (676 row(s) affected)

         1.813 Seconds duration for "GROUP BY

    Here's the table I create to do such tests...

    --===============================================================================================================
    --      Setup
    --===============================================================================================================
    --===== Create and populate a million row test table to test all sorts of data with.
         -- (This is my standard test table and it can be added to if necessary)
     SELECT TOP 1000000
            IDENTITY(INT,1,1) AS RowNum,
            CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT) AS UserID,
            CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
          + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS SomeValue, --Two letters randomly selected
            'A column for kicks' AS Kicks,
            'Still another column just for proofing' AS StillAnother,
            CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY) AS SomeNumber,
            CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME) AS ADate --(>=01/01/2000  <01/01/2010)
       INTO dbo.BigTest
       FROM Master.dbo.SysColumns sc1,
            Master.dbo.SysColumns sc2
    GO
    --===== A table is not properly formed unless a Primary Key has been assigned
         -- (Other indexes may be added for testing purposes)
      ALTER TABLE dbo.BigTest
            ADD PRIMARY KEY NONCLUSTERED (RowNum)
    GO

    And, of course, here's the test code...

    --===== Declare some duration variables
    DECLARE @StartTime DATETIME
    DECLARE @EndTime   DATETIME
    --===============================================================================================================
    --      Run the test on DISTINCT to find unique values
    --===============================================================================================================
    --===== Clear anything that might be cached
       DBCC DROPCLEANBUFFERS
       DBCC FREEPROCCACHE
    --===== Start the duration timer
        SET @StartTime = GETDATE()
    --===== Run the test (make sure you are in the "GRID MODE"
     SELECT DISTINCT SomeValue
       FROM dbo.BigTest WITH (NOLOCK)
    --===== Stop the duration timer
        SET @EndTime = GETDATE()
    --===== Display the duration in decimal seconds
      PRINT STR(DATEDIFF(ms,@StartTime,@EndTime)/1000.0,10,3) + ' Seconds duration for "DISTINCT'
    --===============================================================================================================
    --      Run the test on GROUP BY to find unique values
    --===============================================================================================================
    --===== Clear anything that might be cached
       DBCC DROPCLEANBUFFERS
       DBCC FREEPROCCACHE
    --===== Start the duration timer
        SET @StartTime = GETDATE()
    --===== Run the test (make sure you are in the "GRID MODE"
     SELECT SomeValue
       FROM dbo.BigTest WITH (NOLOCK)
      GROUP BY SomeValue
    --===== Stop the duration timer
        SET @EndTime = GETDATE()
    --===== Display the duration in decimal seconds
      PRINT STR(DATEDIFF(ms,@StartTime,@EndTime)/1000.0,10,3) + ' Seconds duration for "GROUP BY'

    Maybe it's just me but I'm thinking DISTINCT and GROUP BY are identical in performance and function when used to find unique items...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • quoteI like code to be self-documenting so I know wth I was doing when I have to maintain it a year from now. So if the intention is to remove duplicate rows, I use DISTINCT to make it clear what the intention of the code is, rather than scratching my head wondering if someone forgot to add an aggregate expression.

    ABSOLUTELY CONCUR WITH THAT!  Well done!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Great script code for testing!

    Is there a difference when having index on the table? Clustered/non-clustered? With/without NOLOCK hint?

    Running your test-script without any modification, I got following times (50 executions)

         2.910-2.976 seconds duration for "DISTINCT (min/max times)

         2.833-2.914 seconds duration for "GROUP BY (min/max times)

    Adding clustered index on SomeValue (50 executions)

         1.093-1.155 seconds duration for "DISTINCT (min/max times)

         0.906-1.013 seconds duration for "GROUP BY (min/max times)

    Adding non-clustered index on SomeValue (50 executions)

         0.826-0.976 seconds duration for "DISTINCT (min/max times)

         0.906-0.991 seconds duration for "GROUP BY (min/max times)

    It seems to me that without any index or with a clustered index on SomeValue, GROUP BY is faster than DISTINCT.

    ALL EXECUTIONS HAD EXACTLY THE SAME IDENTICAL EXECUTION PLAN. 

    BUT

    having a non-clustered index on SomeValue made DISTINCT faster than GROUP BY

    Adding/removing WITH NOLOCK made minimal impact. Removing WITH NOLOCK just added 3-7 ms.

    Running MS SQL Server 2000 SP4 Developer Edition version 8.00.2039

    Fujitsu Siemens E-series Lifebook

    2.0 GHz P4 processor, 1 GB memory.

    Windows XP SP2.


    N 56°04'39.16"
    E 12°55'05.25"

  • Wow!  Nice job of testing!  That's some awesome feedback, Peter.

    It also goes to show that the machine can make a bit of a difference, as well.  My testing was done on the following (just for comparison purposes, folks)...

    MS SQL Server 2000 SP3a Developer's Edition on Windows XP SP2+

    1.8 GHz P5, 2 GB memory

    IDE drives

    Log and Data files on same drive (it's a home machine )

    Thanks again for the feedback, Peter.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You're welcome, Jeff. Too bad I don't have my old computer with me that had SQL2KSP3 installed to compare times with. 


    N 56°04'39.16"
    E 12°55'05.25"

  • Awesome information, guys!

    Many, many thanks.

    Just so you know, given the query and what it was doing, I changed it to DISTINCT.  If nothing else, it will help the developer that comes after me.

  • quoteIf nothing else, it will help the developer that comes after me.

    I wish more folks programmed with that thought in mind.  Thank you for the feedback, Pam.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, any idea why DISTINCT behaved significantly different in times using either clustered/non-clustered index and why GROUP BY didn't change times much?

    My gut tells my that a clustered index should be faster, but it isn't.

    I have reproduced the executions on a SQL2005 server today, with similar results.


    N 56°04'39.16"
    E 12°55'05.25"

  • Nope, sorry... I'm afraid only the boys in Redmond know the answer to that one (Group By/Distinct time changes) although it does make one question the myth that clustered indexes are always faster ...

    So far as clustered indexes go, I've always been leary of them even in a low transaction scenario... seems to me that the query has to sift through the data pages to get at the index especially on wide tables... there are exceptions, of course, but I've found that non-clustered indexes will, many times, outperform clustered indexes because non-clustered indexes contain only index information... more row info per 8192 byte page/ fewer reads, I suppose.  Lot's of times, I'll force the primary key to be a non-clustered index for the sake of speed depending on the majority of big queries that have to hit the table.

    A great exception to my "use non-clustered primary keys" rule is a "Tally" or "Numbers" table which only has one column in it.

    The bottom line is testing means everything...

    OK, Peter, duck... here they come

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply