October 21, 2015 at 1:26 pm
Hello. All examples of the EXIST operator I know of use correlated subqueries that could be also written as a join with no performance gain or loss. Can anyone give me an example of a subquery that can only be written using EXISTS?
Thank you!
October 21, 2015 at 1:44 pm
The simple example is if the subquery returns more than one row for the joining condition.
If you use a join, the results contain extra rows.
EXISTS will not create the extra rows.
You can certainly eliminate the extra rows with a GROUP BY or DISTINCT, but that almost certainly will cause a performance hit.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
October 21, 2015 at 2:06 pm
Thanks for the reply. Sounds perfectly reasonable, except that when I compare the 2 queries, one written with the EXISTS operator and a correlated subquery and the other as a join, the execution plans and costs are identical.
October 21, 2015 at 2:15 pm
http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/
http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 21, 2015 at 2:18 pm
I think that it's always possible to rewrite an EXISTS query with an INNER JOIN or a CROSS APPLY.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 21, 2015 at 2:22 pm
Can you post some sample data and your queries?
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
October 21, 2015 at 2:39 pm
Thanks, Gail and Drew. Gail, the links don't directly discuss EXISTS vs join, but it looks like one can conclude what Drew is saying that there is really no difference in performance, is that what you would say, too? Unless, perhaps the data set I'm using is not big enough to produce a noticeable difference in performance?
October 21, 2015 at 2:43 pm
Michael, I'm using the AdventureWorks database and the following:
USE AdventureWorks
SELECT Name
FROM Production.Product p
WHERE EXISTS
(SELECT *
FROM Production.ProductSubcategory
WHERE ProductSubcategoryID = p.ProductSubcategoryID
AND Name = 'Wheels')
SELECT p.Name
FROM Production.Product p
JOIN Production.ProductSubcategory psc
ON psc.ProductSubcategoryID = p.ProductSubcategoryID
WHERE psc.Name = 'Wheels'
October 21, 2015 at 2:50 pm
peter 82125 (10/21/2015)
Thanks, Gail and Drew. Gail, the links don't directly discuss EXISTS vs join, but it looks like one can conclude what Drew is saying that there is really no difference in performance, is that what you would say, too? Unless, perhaps the data set I'm using is not big enough to produce a noticeable difference in performance?
Note, I did not say that there would be no difference in performance. I just said that they could be rewritten without using EXISTS. I've made assumptions about relative performance in the past and have learned to always test if I think that there might be a better performing alternative. I make no claims about the performance of tenuous queries.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
October 21, 2015 at 5:35 pm
Duly noted.
October 21, 2015 at 6:21 pm
It may be rather academic, but there is at least one case where EXISTS is nicer than a JOIN.
The potential (emphasis on potential) performance advantage that EXISTS has is the use of a semi join, which allows the subquery to effectively short-circuit when a match is found (on that note, the query plans you're seeing are likely not actually identical; while the physical join type might well be the same in both cases, the logical join type for the EXISTS query probably shows as a left semi join instead of an inner join).
If there are very few matches, then you may see very little difference in performance, since there won't be much use of short-circuiting when a match is found. If there are many, many matches, then you'll start to see a bigger difference.
Also, if the criterion for matching rows is an equality condition, then once you have enough rows, both the JOIN and the EXISTS will likely end up using hash joins and just scan each table once, again minimizing the difference.
If the criterion for matching rows is not an equality condition and there are many matches, then you can start to see a big difference between EXISTS and the JOIN. Going back to what I said about this being somewhat academic, it's true that matching rows on a condition other than equality is rather uncommon compared to matching rows on equality. Still, this is a case where EXISTS can significantly outperform a JOIN.
Below is some code to illustrate that (takes about a minute to run on my machine; if you have a beastly machine, it might finish much more quickly:-)):
--Of course don't run the FREEPROCCACHE on your production server.
--I'm just running it so my query stats come back more cleanly
--on my lab server
DBCC FREEPROCCACHE
CREATE TABLE #BlackList (Pattern varchar(30))
CREATE TABLE #Phrases (Phrase char(36))
--Populate a bunch of "phrases"
INSERT INTO #Phrases
SELECT TOP 5000 CAST(NEWID() AS char(36))
--Construct a bunch of rows
FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t1(n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t2(n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t3(n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t4(n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t5(n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t6(n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t7(n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t8(n)
CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t9(n)
--Populate a bunch of blacklisted patterns based on the phrases
--we just created
INSERT INTO #BlackList
SELECT TOP 1000 '%'+SUBSTRING(Phrase,4,5)+'%'
FROM #Phrases
INSERT INTO #BlackList
SELECT TOP 1000 '%'+SUBSTRING(Phrase,15,6)+'%'
FROM #Phrases
INSERT INTO #BlackList
SELECT TOP 1000 '%'+SUBSTRING(Phrase,23,3)+'%'
FROM #Phrases
--I like to use EXEC just to make sure the query text doesn't change
--just because I include a little extra white space or some such
--nonsense. That makes sure I keep the stats under the same entry
--in sys.dm_exec_query_stats.
--I also like to prefix with "--QueryTest:" so I can find the query
--stats easily and don't have to run SET STATISTICS or other
--observer-effect-laden things
--I'm also using MAXDOP 1 just to avoid any quirks from parallelism
--Ok, let's find out how many of our phrases run afoul of the blacklist
EXEC('--QueryTest: INNER JOIN
SELECT COUNT(Phrase)
FROM #Phrases Phr
INNER JOIN #Blacklist Pat
ON Phr.Phrase LIKE Pat.Pattern
OPTION (MAXDOP 1)')
EXEC('--QueryTest: EXISTS
SELECT COUNT(*) FROM #Phrases Phr
WHERE EXISTS (SELECT NULL FROM #Blacklist Pat WHERE Phr.Phrase LIKE Pat.Pattern)
OPTION (MAXDOP 1)')
--The INNER JOIN above gives us the wrong counts because some phrases
--match multiple patterns, so we have to add a DISTINCT to remedy that
EXEC('--QueryTest: DISTINCT/INNER JOIN
SELECT COUNT(DISTINCT Phrase)
FROM #Phrases Phr
INNER JOIN #Blacklist Pat
ON Phr.Phrase LIKE Pat.Pattern
OPTION (MAXDOP 1)')
--CROSS APPLY with TOP 1 is another option that can perform reasonably
--under the right circumstances. Here it performs similarly to EXISTS.
EXEC('--QueryTest: CROSS APPLY
SELECT COUNT(Phrase)
FROM #Phrases Phr
CROSS APPLY (SELECT TOP 1 Pattern FROM #Blacklist Pat WHERE Phr.Phrase LIKE Pat.Pattern) X
OPTION (MAXDOP 1)')
SELECT
qs.last_logical_reads,
qs.last_worker_time/1000 AS CPU_ms,
qs.last_elapsed_time/1000 AS Duration_ms,
qs.last_execution_time, qp.query_plan, qt.text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
CROSS APPLY sys.dm_exec_sql_text(qs.plan_handle) qt
WHERE qt.text LIKE '--QueryTest%'
ORDER BY last_execution_time desc
DROP TABLE #Blacklist, #Phrases
You can play around with the ratio between Phrases and matching Blacklist patterns to see how that affects the performance difference.
Hopefully this helps! (Also, full disclosure, it's getting late and I'm severely under-caffeinated, so if something I've said seems ridiculous, it probably is, and I apologize in advance for that :-))
Cheers!
October 22, 2015 at 2:25 am
Here is a related blog by Aaron Bertrand: sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join
October 22, 2015 at 2:38 am
peter 82125 (10/21/2015)
Thanks, Gail and Drew. Gail, the links don't directly discuss EXISTS vs join, but it looks like one can conclude what Drew is saying that there is really no difference in performance, is that what you would say, too? Unless, perhaps the data set I'm using is not big enough to produce a noticeable difference in performance?
There's a roundup that I didn't post the link for which compares all three. Still, EXISTS compared to IN and IN to join, you can extrapolate the result easily enough
There is a performance difference, but it's minimal. It's not enough that I'd convert a join to EXISTS for performance reasons, but I will use EXISTS when I'm looking for whether there are rows or not and joins when I need to actually join.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 22, 2015 at 8:58 am
EXISTS is also safer than IN when you're dealing with a nullable column.
October 22, 2015 at 9:24 am
Thank you all so much for your informative replies! Jacob, you rock, caffeine or none! Really appreciate it 🙂
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply