During a recent webcast, someone asked if, in an EXISTS query, it is better to use “SELECT *”, “SELECT ‘1’”, or some other variation. I promised a followup blog post to explore this and try to come up with a definitive answer. This is that post.
All of the examples you see here are designed to work with the AdventureWorks database. If you don’t already have AdventureWorks, you can get it for free from here: http://msftdbprodsamples.codeplex.com/releases/view/55330
Let’s start by looking at four variations of the same EXISTS statement. All of these do the same thing – they each check to see if there are any rows in the SalesOrderHeader table with a NULL CurrencyRateID. If there are, the word “Exists” is printed.
IF EXISTS(SELECT * FROM Sales.SalesOrderHeader WHERE CurrencyRateID IS NULL) PRINT 'Exists'; IF EXISTS(SELECT 1 FROM Sales.SalesOrderHeader WHERE CurrencyRateID IS NULL) PRINT 'Exists'; IF EXISTS(SELECT 'x' FROM Sales.SalesOrderHeader WHERE CurrencyRateID IS NULL) PRINT 'Exists'; IF EXISTS(SELECT NULL FROM Sales.SalesOrderHeader WHERE CurrencyRateID IS NULL) PRINT 'Exists';
The question that we’re trying to answer is “which of these is the best way to do this?”. In this case, “best” means the cheapest, most efficient way. We have four different SQL statements that we want to compare in terms of relative expense. A good place to start is with the query plans. Let’s put these four statements together in Management Studio and look at the query plans:
All four of these statements produce the same query plan. In terms of relative expense, each of them is equal, weighing in at 25% of the overall batch expense. So far, it seems that there is no difference between these four statements – there is no “best” variation. Let’s dig deeper…
Let’s compare the IO generated by each of these four statements. By running the four statements, preceded by “SET STATISTICS IO ON”, we get the following output:
All four statements produced the same amount of IO – exactly the same. Again, there is no “best” variation. Let’s dig deeper…
Let’s compare the execution times of these four statements. By running all four statements, preceded by “SET STATISTICS TIME ON” and “SET STATISTICS IO OFF”, we get the following output:
After waiting 5ms for the batch to compile, all four statements complete in 0ms. For all intents and purposes, they are again identical in terms of performance. However, we know that there was some execution time involved. That measure of 0ms is not entirely accurate, the statements are just completing too quickly to be measured. Let’s try something different…
Let’s put each of these four statements inside a loop, and run each one a million times, recording the total elapsed time for each statement:
DECLARE @Loop INT; DECLARE @StartDateTime DATETIME; SET @StartDateTime = GETDATE(); SET @Loop = 0; WHILE @Loop < 1000000 SET @Loop = @Loop + CASE WHEN EXISTS(SELECT * FROM Sales.SalesOrderHeader WHERE CurrencyRateID IS NULL) THEN 1 ELSE 1 END; PRINT 'Elapsed time: ' + CAST(DATEDIFF(MS, @StartDateTime, GETDATE()) AS VARCHAR(25)) + ' ms'; GO DECLARE @Loop INT; DECLARE @StartDateTime DATETIME; SET @StartDateTime = GETDATE(); SET @Loop = 0; WHILE @Loop < 1000000 SET @Loop = @Loop + CASE WHEN EXISTS(SELECT 1 FROM Sales.SalesOrderHeader WHERE CurrencyRateID IS NULL) THEN 1 ELSE 1 END; PRINT 'Elapsed time: ' + CAST(DATEDIFF(MS, @StartDateTime, GETDATE()) AS VARCHAR(25)) + ' ms'; GO DECLARE @Loop INT; DECLARE @StartDateTime DATETIME; SET @StartDateTime = GETDATE(); SET @Loop = 0; WHILE @Loop < 1000000 SET @Loop = @Loop + CASE WHEN EXISTS(SELECT 'x' FROM Sales.SalesOrderHeader WHERE CurrencyRateID IS NULL) THEN 1 ELSE 1 END; PRINT 'Elapsed time: ' + CAST(DATEDIFF(MS, @StartDateTime, GETDATE()) AS VARCHAR(25)) + ' ms'; GO DECLARE @Loop INT; DECLARE @StartDateTime DATETIME; SET @StartDateTime = GETDATE(); SET @Loop = 0; WHILE @Loop < 1000000 SET @Loop = @Loop + CASE WHEN EXISTS(SELECT NULL FROM Sales.SalesOrderHeader WHERE CurrencyRateID IS NULL) THEN 1 ELSE 1 END; PRINT 'Elapsed time: ' + CAST(DATEDIFF(MS, @StartDateTime, GETDATE()) AS VARCHAR(25)) + ' ms'; GO
When I run this, it takes about 8 seconds to complete on my test machine, and produces the following output:
Once again, practically identical results. All four statements run 1 million times in the same elapsed time. Let’s repeat the exercise:
And once more, just to be sure:
Virtually identical results each time. That’s enough for me…
Identical query plans…
Identical levels of IO…
Identical execution times…
I’m comfortable saying that when used inside of an EXISTS statement, it makes no difference what you use as the column list for SELECT. You’re safe to use “SELECT *”, “SELECT 1”, “SELECT ‘x’”, “SELECT NULL”, or whatever other variation you choose. It makes no difference in terms of performance. There are bigger things to worry about…