This is going to be a quick one…
I keep seeing forum code (and production code) that includes the DISTINCT in IN or EXISTS subqueries. The rationale is given either as a performance enhancement or as necessary for correct results.
Is it necessary or useful? Only one way to find out.
Let’s check for correct results first, because that can be done with nice small tables.
CREATE TABLE DistinctOuter (
ID INT
);
CREATE TABLE DistinctInner (
ID INT
);
INSERT INTO DistinctOuter
VALUES (1), (2), (3), (4), (5), (6), (7), (8)
INSERT INTO DistinctInner
VALUES (1), (2), (2), (2), (2), (4), (6), (7)
No difference there, results are the same. I’m not going to run the test for EXISTS because, if anyone remembers how EXISTS works (or remembers a blog post I wrote a while back), EXISTS doesn’t depend on what’s in the SELECT clause at all, it just looks for existence of rows, and DISTINCT cannot remove unique rows, just duplicates.
A look at the execution plan shows why there are no duplicate values returned in the first query (the one without DISTINCT).
That’s a semi-join there, not a complete join. A semi-join is a join that just checks for matches but doesn’t return rows from the second table. Since it’s just a check for existence, duplicate rows in the inner table are not going to make any difference to the results.
So that answers the correctness aspect, distinct is not necessary to get correct results. But does it improve performance by having it there? Or does it perhaps reduce the performance? Time for larger tables.
Stolen from my last look at EXISTS and IN:
CREATE TABLE PrimaryTable_Large (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
);
CREATE TABLE SecondaryTable_Large (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
);
GO
INSERT INTO PrimaryTable_Large (SomeColumn)
SELECT top 1000000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from msdb.sys.columns a cross join msdb.sys.columns b;
INSERT INTO SecondaryTable_Large (LookupColumn)
SELECT SomeColumn
FROM PrimaryTable_Large TABLESAMPLE (25 PERCENT);
Some row counts first.
- Total rows in PrimaryTable_Large: 1000000
- Total rows in SecondaryTable_Large: 256335
- Total distinct values in LookupColumn in SecondaryTable_Large: 10827
First test is without indexes on the lookup columns:
SELECT ID, SomeColumn FROM PrimaryTable_Large
WHERE SomeColumn IN (SELECT LookupColumn FROM SecondaryTable_Large)
SELECT ID, SomeColumn FROM PrimaryTable_Large
WHERE SomeColumn IN (SELECT DISTINCT LookupColumn FROM SecondaryTable_Large)
The reads are identical, which shouldn’t be a surprise as there’s no way with the current tables to run those queries without doing a full table scan.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘PrimaryTable_Large’. Scan count 1, logical reads 14548, physical reads 0.
Table ‘SecondaryTable_Large’. Scan count 1, logical reads 798, physical reads 0.
For durations and CPU, I’m going to run each 10 times and aggregate the results from the profiler T-SQL:BatchCompleted event. The results are just about identical.
- IN without DISTINCT: CPU 1.21 seconds, duration 12.2 seconds
- IN with DISTINCT: CPU 1.25 seconds, duration 11.9 seconds
Furthermore, the execution plans are identical. Something interesting to notice in this case is that the join is not a semi-join, it’s a complete join and to ensure that the complete join doesn’t return duplicate rows (which would be incorrect), there’s a hash match (aggregate) right before the join that’s removing duplicate rows from the inner resultset, and that’s present in both execution plans, when the distinct is specified and when it’s not.
One last question to answer – does the presence of indexes change anything?
CREATE INDEX idx_Primary
ON dbo.PrimaryTable_Large (SomeColumn)
CREATE INDEX idx_Secondary
ON dbo.SecondaryTable_Large (LookupColumn)
The execution plan has changed, in operators if not in general form. The hash join is replaced by a merge join (still a complete join, not a semi-join), the hash match (aggregate) has been replaced by a stream aggregate and the clustered index scans are now (nonclustered) index scans
The reads are still identical between the two, which should be no surprise at all. As for the durations:
- IN without DISTINCT: CPU 0.82 seconds, duration 10.1 seconds
- IN with DISTINCT: CPU 0.79 seconds, duration 10.5 seconds
Again so close that the small difference should be ignored.
So in conclusion, is there any need or use for DISTINCT in the subquery for an IN predicate? By all appearances, none whatsoever. The SQL query optimiser is smart enough to ignore the specified DISTINCT if it’s not necessary (as we saw in the first example) and to add an operator to remove duplicates if it is necessary (as we saw in the 2nd and 3rd examples), regardless of whether or not there’s a DISTINCT specified in the query.