Distinct going slower than *

  • Grant Fritchey (2/20/2009)


    But that example compares one thing you shouldn't do, SELECT *, with another thing that has serious enough costs that it almost needs to be in the shouldn't do category, SELECT DISTINCT. Hardly a good comparison. Check out mine, where the number of rows are the same and a good index is available for either query to use.

    Grant, Cynical here. Is the above directed to me or the OP? I'm beginning to question everything I do for some reason.

    😀

  • forgive me for being blunt by why would anyone do a distinct that is going to return the same number of rows as the full SELECT?

    The whole point (at least to us in the application biz) of a distinct is to get an exact list of distinct values from one or two columns of the query. I never use them to "remove junk data" .... ala .... I am returning 6 identical rows and only want 1 of them .... if that was the case then I look at my query and see why I am getting duplicated rows. If it's in a table where I'm expecting duplicates then I'm probably going to be doing some form of summarization on certain elements of the query so it's going to get grouped by anyway.

  • Lynn Pettis (2/20/2009)


    Grant Fritchey (2/20/2009)


    But that example compares one thing you shouldn't do, SELECT *, with another thing that has serious enough costs that it almost needs to be in the shouldn't do category, SELECT DISTINCT. Hardly a good comparison. Check out mine, where the number of rows are the same and a good index is available for either query to use.

    Grant, Cynical here. Is the above directed to me or the OP? I'm beginning to question everything I do for some reason.

    😀

    The OP. Sorry. Your post snuck in there.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Following what Lynn is saying ...... I believe you all are looking at this solely from the role of a DBA who only cares about the processing time at a server level.

    That is definitely not the case. Only DBAs care about processing without delivery 😛

    If it takes half a second to process and 20 minutes to get to me then that, for all application purposes, is a 20 minute query.

    I need to know why a Distinct query is taking 74 more seconds to come back with results to me than a SELECT *

    Note: I know all the pitfalls of using * and I never use it except in hypotheticals like walking through problems like this on a forum or among people when it's easier to say/type than the whole list of columns.

  • Then you can't compare SELECT * FROM yourTable against SELECT DISTINCT yourColumn FROM yourTable. It's comparing apples to oranges; yea they are both fruits but they aren't the same.

  • Putts (2/20/2009)


    forgive me for being blunt by why would anyone do a distinct that is going to return the same number of rows as the full SELECT?

    The whole point (at least to us in the application biz) of a distinct is to get an exact list of distinct values from one or two columns of the query. I never use them to "remove junk data" .... ala .... I am returning 6 identical rows and only want 1 of them .... if that was the case then I look at my query and see why I am getting duplicated rows. If it's in a table where I'm expecting duplicates then I'm probably going to be doing some form of summarization on certain elements of the query so it's going to get grouped by anyway.

    Then, for give me for being blunt, why do you use DISTINCT at all. You should be able to pull data from the database (assuming good structure and good data) without resorting to DISTINCT. It is, almost without fail, used as a crutch because the structure is off, the data is off, or the people writing the query don't have the requisite TSQL knowledge to avoid duplicate data. Cleaning the structure, the data or the code removes the need for the crutch.

    You've seen what it does to your own query. Check the query plan yourself. We're not making stuff up here.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • A better comparision, then, would be SELECT myColumn FROM myTable to SELECT DISTINCT myColumn FROM myTable.

    Be sure to look at the actual execution plans of both of them.

  • Lynn Pettis (2/20/2009)


    A better comparision, then, would be SELECT myColumn FROM myTable to SELECT DISTINCT myColumn FROM myTable.

    Be sure to look at the actual execution plans of both of them.

    But even in that case, to his point, you're going to return X number of rows vs. Y number of rows.

    I'm still of the opinion that if you need to get a DISTINCT list out of the system, seperate from the normal storage, without the benefits of JOINS, WHERE clauses, CROSS APPLY, whatever, then your data is off or your structure is off.

    There are exceptions to that, absolutely, but from what you're saying, it's the rule and, as your own query shows you, that just ain't so.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Putts (2/20/2009)


    SELECT * completes in about a minute and returns 2.9 million rows

    SELECT DISTINCT completes in 0 seconds and returns 36 rows.

    Run that with STATISTICS TIME on and see how much of that is processing time and how much is network or display time.

    2.9 million rows is going to take much longer to traverse the network and be displayed than 36 rows. The 36 rows would have taken longer for SQL to process.

    If it takes half a second to process and 20 minutes to get to me then that, for all application purposes, is a 20 minute query.

    Why would you ever send so much data over the network that it would require more than a couple seconds?

    We ignore delivery time because, except in very rare circumstances, you should never be sending so much data back to the app that it takes a measurable amount of time.

    I need to know why a Distinct query is taking 74 more seconds to come back with results to me than a SELECT *

    Because, as Grant and I have both said more than once, doing a distinct requires SQL to do more work. It has to sort or hash the entire result set and eliminate duplicates. That takes processing time.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hey, I'm just trying to give him apples to compare to apples. I actually agree with you for the most part. I'm sure I can find some reason for doing the distinct if I look hard enough, but most of the examples I can thing of would really use group by because I'd be doing some kind of aggregate on the values (count, sum, max, something).

  • Lynn Pettis (2/20/2009)


    Hey, I'm just trying to give him apples to compare to apples. I actually agree with you for the most part. I'm sure I can find some reason for doing the distinct if I look hard enough, but most of the examples I can thing of would really use group by because I'd be doing some kind of aggregate on the values (count, sum, max, something).

    I'm typing at cross purposes again. Ignore me. The first line was for you. The remaining were for the OP. Sorry for the confusion.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Modified * to column list and performance of results does not increase. Not really sure why you demanded that for what we are discussing.

    My contention is simply that Distinct has almost always returned results back to me quicker than a full Select except in this one case and I'm trying to determine why.

    Attached new plan with column list

  • Putts (2/20/2009)


    My contention is simply that Distinct has almost always returned results back to me quicker than a full Select except in this one case and I'm trying to determine why.

    Because in this case the difference in data set sizes (and resultant transmission and display speeds) is not enough to hide the increased processing time from the distinct?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Did you compare that against the SELECT DISTINCT on a single column, or did you compare it to the SELECT DISTINCT of the column list?

    I looked at the execution plans, I have no idea. if you encapsulate the two queries in between SET STATISTICS IO ON; SET STATISTICS TIME ON and SET STATISTICS IO OFF; SET STATISTICS TIME OFF, what do you get?

  • If you have a million row table and you send select /Columns/ where /criteria/

    to the app it will send alot more data across the network but will process (depending on indexing) on the database quickly.

    If you then try to fill your dropdown with select distinct /Columns/ where /Criteria/ from the million row table the database will take a lot longer to process the query, but it will traverse the network to your app alot quicker since it is a smaller result set.

    On small datasets you probably do not notice the performance hit of the distinct.

    On a large dataset you will for all the reasons mentioned above.

    Your design should probably have some table with the distinct values you are trying to display and that is linked to your million row transaction table.

    Then you query against the normalized structure.

    if your execution plan is applying your where criteria after performing the distinct (table / index scan ) then it will really take a long time.

    You could wrap your query in a derived table and then distinct it for a short term solution that may work, but it is a flawed design.

    SELECT DISTINCT X. /Columns/ FROM

    ( SELECT /Columns/ WHERE /Criteria/ ) X

    I doubt this will change the execution plan however.

Viewing 15 posts - 16 through 30 (of 57 total)

You must be logged in to reply to this topic. Login to reply