DISTINCT and IN

  • can DISTINCT keyword be replaced with " IN" for better performance in some scenarios?

  • your question is a bit vague. can you please provide a full explanation?

    if you question is this:

    field in (select ... from ...)

    is functionally identical to

    field in (select distinct ... from ...)

    i'm a "less is better" person, so i'd omit the distinct since it causes a bit more work for the query engine to handle.

  • Yeah, they are from totally different parts of a query. DISTINCT is part of the select clause and IN part of the where clause.

    Some performance points though could be...

    If you can use your where clause or joins so that duplicate results are not returned in the first place then not needing to use distinct will be a performance boost.

    Also EXISTS is generally much better performance wise than IN

    E.G.

    SELECT a.col1,a.col2

    FROM tableA a

    WHERE a.col1 IN (SELECT col3 FROM tableB)

    Would be more efficient as

    SELECT a.col1,a.col2

    FROM tableA a

    WHERE EXISTS (SELECT 1 FROM tableB where col3 = a.col1)

    Basically in the first query all rows are queried from tableB then compared to tableA. In the second only matching columns are returned from tableB. Also note using SELECT 1 in this instance instead of select *. Since there is no need to reference the values returned there is a small saving in not having to retrieve a value from the table (especially not retrieving every single column)

    [font="Arial"]_______________________________________________________
    Change is inevitable... Except from a vending machine.[/font]

  • can i write your second query like this...

    SELECT a.col1,a.col2

    FROM tableA a

    WHERE a.col1 IN (SELECT 1 FROM tableB b inner join tableA a

    ON a.col1=b.col1

    and a.col1=b.col1)

  • [font="Verdana"]Actually, if you replace DISTINCT with IN, because they are completely different parts of the query, the query will fail. However, it will fail really quickly. Hence, performance is improved. :P[/font]

  • I think you're referring back to another set of posts where someone suggested that using the IN clause in the query eliminated the need for the DISTINCT clause. Am I right?

    What they meant was that the use of the IN clause filtered your records such that you had a distinct list, based on the key values included in the IN clause. That made trying to filter the records through the use of the DISTINCT clause unnecessary.

    DISTINCT is, more often than not, a crutch to cover up a bad query, bad structure, or bad data. In the vast majority of cases, you should be able to appropriately filter your data through the proper use of JOINS and information in the WHERE clause, such as the IN statement, to get a unique set of rows without resorting to the use of DISTINCT. DISTINCT generally uses some method of aggregation to eliminate duplicate rows and frequently is unable to do this using existing indexes on the tables, so it's very slow.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Mike Levan (2/15/2009)


    can i write your second query like this...

    SELECT a.col1,a.col2

    FROM tableA a

    WHERE a.col1 IN (SELECT 1 FROM tableB b inner join tableA a

    ON a.col1=b.col1

    and a.col1=b.col1)

    Note that aside from Grant's comments, which seem to put a different spin on your question, rewriting the query above is not achieving the result you want for several reasons.

    1 - the "second query" you are referring to was using EXISTS, and therefore used a JOIN to filter the results and used a generic 'SELECT 1', since it was just checking for existence. IN and EXISTS are not equivalent, but you can achieve the same results in the overall query by using EXISTS.

    2 - Because IN and EXISTS are not equivalent, the 'SELECT 1' in the subquery will return '1' if it exists, and therefore using a.col1 IN (SELECT 1 . . .blah blah blah) will only work if a.col1 = 1

    Hope that helps to explain the difference a little more.

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply