SQL Decides to NOT Use My DISTINCT Clause. Why?

Question

Post reply

SQL Decides to NOT Use My DISTINCT Clause. Why?

souLTower

SSCarpal Tunnel

Points: 4026
More actions
July 23, 2010 at 12:12 pm

#225373

-- This demonstrates how SQL server's decision to not use your DISTINCT
-- clause can affect your performance greatly.
set nocount on
create table #test (partID int, partDescription varchar(100), otherID int)
-- Populate with a bunch of stuff for otherID = 100
insert into #test(partID, partDescription, otherID) values(0, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(1, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(2, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(3, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(4, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(5, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(6, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(7, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(8, 'My Part', 100)
insert into #test(partID, partDescription, otherID) values(9, 'My Part', 100)
-- That's 10.
insert into #test(partID, partDescription, otherID)
select A.partID, A.partDescription, A.otherID from #test A, #test B
insert into #test(partID, partDescription, otherID)
select A.partID, A.partDescription, A.otherID from #test A, #test B
-- select count(*) from #test 12210
-- Now some more data
-- Populate with a bunch of stuff for otherID = 300
-- Skip partID 5
insert into #test(partID, partDescription, otherID)
select partID, partDescription, 300D from #test where partID <> 5
-- Check the difference in execution plans
-- takes 5 seconds to run
-- Ignores inner distincts effectively joining 12K rows to 12K rows
select distinct A.partID from
(
select distinct partID from #test where otherID = 100
) A left join
(
select distinct partID from #test where otherID = 300
) B on A.partID = B.partID
WHERE B.partID is null
-- Respects inner distincts effectively joining 10 rows
-- takes 0ms to run
select A.partID from
(
select distinct partID from #test where otherID = 100
) A left join
(
select distinct partID from #test where otherID = 300
) B on A.partID = B.partID
WHERE B.partID is null
Is this a documented feature? Could this creep up if I used views instead of derived queries?
Thanks
God Bless
ST

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Sergiy SSC Guru Points: 110208 More actions · Answer 1

DISTINCT is not a silver bullet.

It's quite opposite.

I'd suggest not to use it in production code at all. Ever.

DISTINCT/GROUP BY is applied after all joins and WHERE clause.

If you have 2 or more DISTINCTS the topmost only will be left by optimiser because all others are useless.

Don't use construction LEFT JOIN + WHERE Key IS NULL.

Use WHERE NOT EXISTS instead.

Compare performance of this one:

select distinct A.partID from

(

select distinct partID from #test where otherID = 100

) A

WHERE NOT EXISTS ( select partID from #test B where otherID = 300 and A.partID = B.partID )

_____________
Code for TallyGenerator

souLTower SSCarpal Tunnel Points: 4026 More actions · Answer 2

I have become aware that my DISTINCT is ignored. My question is why? Is this by design? My real world data contains 100K+ records in each side of the search (represented by tables A and B in this example). The DISTINCT on each side was intended to reduce the amount of data the server had to search through to find the answer. Effectively it reduced it to roughly 100 records per side.

In production I tried the following:

WHERE NOT EXISTS (sub query)

Multiple distincts (as demonstrated in the example)

No distincts at all

Inserting into 2 temp tables the results from the left and right queries, then performing a left join on the temp tables.

I found that using distinct on the left and right queries AND the temp table strategy performed at nearly the same performance. The advantage of the temp tables was that I could use an additional distinct on the final query. This is necessary in my production environment.

Why I found this example troubling is that it demonstrates a condition that could happen in the real-world; where 2 different developers may write views using distinct (with good reason). A third developer could come in and use those views with an additional distinct clause and the performance would go through the floor. Similarly I can imagine this could also happen when combining multiple views with aggregate functions, such as SUM and AVG, and group by statements.

I saw nothing in books online stating that multiple distincts are ignored.

Thanks

ST