DISTINCT and IN

Question

Post reply

DISTINCT and IN

Mike Levan

SSChampion

Points: 14201
More actions
February 14, 2009 at 12:24 pm

#201198

can DISTINCT keyword be replaced with " IN" for better performance in some scenarios?

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 1

your question is a bit vague. can you please provide a full explanation?

if you question is this:

field in (select ... from ...)

is functionally identical to

field in (select distinct ... from ...)

i'm a "less is better" person, so i'd omit the distinct since it causes a bit more work for the query engine to handle.

Rob Fisk SSCommitted Points: 1861 More actions · Answer 2

Yeah, they are from totally different parts of a query. DISTINCT is part of the select clause and IN part of the where clause.

Some performance points though could be...

If you can use your where clause or joins so that duplicate results are not returned in the first place then not needing to use distinct will be a performance boost.

Also EXISTS is generally much better performance wise than IN

E.G.

SELECT a.col1,a.col2

FROM tableA a

WHERE a.col1 IN (SELECT col3 FROM tableB)

Would be more efficient as

SELECT a.col1,a.col2

FROM tableA a

WHERE EXISTS (SELECT 1 FROM tableB where col3 = a.col1)

Basically in the first query all rows are queried from tableB then compared to tableA. In the second only matching columns are returned from tableB. Also note using SELECT 1 in this instance instead of select *. Since there is no need to reference the values returned there is a small saving in not having to retrieve a value from the table (especially not retrieving every single column)

[font="Arial"]_______________________________________________________
Change is inevitable... Except from a vending machine.[/font]

Mike Levan SSChampion Points: 14201 More actions · Answer 3

can i write your second query like this...

SELECT a.col1,a.col2
FROM tableA a
WHERE a.col1 IN (SELECT 1 FROM tableB b inner join tableA a
ON a.col1=b.col1
and a.col1=b.col1)

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 4

[font="Verdana"]Actually, if you replace DISTINCT with IN, because they are completely different parts of the query, the query will fail. However, it will fail really quickly. Hence, performance is improved. :P[/font]

Grant Fritchey SSC Guru Points: 398886 More actions · Answer 5

I think you're referring back to another set of posts where someone suggested that using the IN clause in the query eliminated the need for the DISTINCT clause. Am I right?

What they meant was that the use of the IN clause filtered your records such that you had a distinct list, based on the key values included in the IN clause. That made trying to filter the records through the use of the DISTINCT clause unnecessary.

DISTINCT is, more often than not, a crutch to cover up a bad query, bad structure, or bad data. In the vast majority of cases, you should be able to appropriately filter your data through the proper use of JOINS and information in the WHERE clause, such as the IN statement, to get a unique set of rows without resorting to the use of DISTINCT. DISTINCT generally uses some method of aggregation to eliminate duplicate rows and frequently is unable to do this using existing indexes on the tables, so it's very slow.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

jcrawf02 SSC-Insane Points: 24198 More actions · Answer 6

Mike Levan (2/15/2009)
can i write your second query like this...
SELECT a.col1,a.col2
FROM tableA a
WHERE a.col1 IN (SELECT 1 FROM tableB b inner join tableA a
ON a.col1=b.col1
and a.col1=b.col1)

Note that aside from Grant's comments, which seem to put a different spin on your question, rewriting the query above is not achieving the result you want for several reasons.

1 - the "second query" you are referring to was using EXISTS, and therefore used a JOIN to filter the results and used a generic 'SELECT 1', since it was just checking for existence. IN and EXISTS are not equivalent, but you can achieve the same results in the overall query by using EXISTS.

2 - Because IN and EXISTS are not equivalent, the 'SELECT 1' in the subquery will return '1' if it exists, and therefore using a.col1 IN (SELECT 1 . . .blah blah blah) will only work if a.col1 = 1

Hope that helps to explain the difference a little more.

---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]

"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."