NOT IN vs. Not Equal (<>)
A coworker recently asked me which was "more efficient - a bunch of <> or a NOT IN clause?" The answer, I assumed, like almost all things relating to databases, is "it depends." The nice thing about that answer is that it not only camouflages my vast stores of ignorance, it is also quite often true.
This time, however, I thought I'd do some investigation to see if, in fact, we should prefer one method over the other in this case. The answer this time was a bit unexpected. It turns out that they are actually the same query, and it should make absolutely no difference. This is because SQL is a declarative language, meaning: you tell the computer what you want, not how to get it [1].
The query engine takes both of these queries and performs them with the exact same sequence of events. In actuality, I did find that one of the queries typically outperformed the other. I also discovered several other methods (written by people much smarter than myself) that were significantly quicker then either of the two methods I set out to test (the quickest method I tested was a full 35% faster than the quicker of the original two).
With all that in mind, let's set off to find out if NOT IN is quicker than a "bunch of <>" and see a few methods that improve in their performance. The original question was accompanied by a snippet of a WHERE clause evaluating the inequality of a column and four integer values:
s.Status_SV <> 214 and s.Status_SV <> 215 and s.Status_SV <> 216 and s.Status_SV <> 217
Now that we know we're dealing with integer data, we'll set up a table with an integer column, then we'll set up a timing and iteration framework for tracking our query times over multiple executions. Just to make our queries lengthy enough to be nicely measurable, I arbitrarily chose to insert 1,000,000. The test results were less conclusive at 10,000 rows (with one notable exception), but the trends we see at 1,000,000 rows are clearly discernable by the time we have 100,000 rows.
Let's set up our table:
PRINT 'Creating sample table...' CREATE TABLE [dbo].[tbl_IN_VS_AND] (filterCriterion_sv int identity(1,1) PRIMARY KEY CLUSTERED NOT NULL) DECLARE @i int SET @i = 0 PRINT 'Populating sample table (1.000.000 rows. This could take several minutes)...' WHILE @i < 1000000 BEGIN INSERT dbo.tbl_IN_VS_AND DEFAULT VALUES SET @i = @i + 1 END
And our execution framework:
SET NOCOUNT ON DECLARE @startTime datetime, @endTime datetime DECLARE @results int, @iterations int DECLARE @i int SET @iterations = 100 ---- setup is finished. let's run some tests PRINT 'Beginning test run...' SET @i = 0 SET @startTime = getDate() WHILE @i < @iterations BEGIN /* test queries go within this loop */ SET @i = @i + 1 END SET @endTime = getDate() PRINT 'Elapsed Time: ' + cast(datediff(ms, @startTime, @endTime) AS varchar) + ' ms'
Now we'll write two queries to exercise our original question (NOT IN vs. <>). We'll start with the NOT IN query, then follow up with the AND <> query:
SELECT @results = count(filterCriterion_sv) FROM tbl_IN_VS_AND WHERE filterCriterion_sv NOT IN (214, 215, 216, 217) SELECT @results = count(filterCriterion_sv) FROM tbl_IN_VS_AND WHERE filterCriterion_sv <> 214 AND filterCriterion_sv <> 215 AND filterCriterion_sv <> 216 AND filterCriterion_sv <> 217
The NOT IN() certainly wins for conciseness. Before running a lot of iterations, let's look at snippets of the query plan text so we know SQL is doing the same thing under the covers for each query.
NOT IN(): |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1005],0))) |--Stream Aggregate(DEFINE:([globalagg1005]=SUM([partialagg1004]))) |--Parallelism(Gather Streams) |--Stream Aggregate(DEFINE:([partialagg1004]=Count(*))) |--Clustered Index Scan(OBJECT:([master].[dbo].[tbl_IN_VS_AND].[PK__tbl_IN_VS_AND__3B4BBA2E]), WHERE:([master].[dbo].[tbl_IN_VS_AND].[filterCriterion_sv]<>(214) AND [master].[dbo].[tbl_IN_VS_AND].[filterCriterion_sv]<>(215) AND [
AND <>:
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1005],0))) |--Stream Aggregate(DEFINE:([globalagg1005]=SUM([partialagg1004]))) |--Parallelism(Gather Streams) |--Stream Aggregate(DEFINE:([partialagg1004]=Count(*))) |--Clustered Index Scan(OBJECT:([master].[dbo].[tbl_IN_VS_AND].[PK__tbl_IN_VS_AND__3B4BBA2E]), WHERE:([master].[dbo].[tbl_IN_VS_AND].[filterCriterion_sv]<>(214) AND [master].[dbo].[tbl_IN_VS_AND].[filterCriterion_sv]<>(215) AND [
It turns out that SQL likes the AND <> so much, it converts the NOT IN() clause into a series of AND <> clauses. After 100 executions of each query, it seems that execution times for the AND <> query tend to be lower than those for the NOT IN query (conversion overhead, maybe?):
Beginning first test run... "NOT IN" ET: 46170 ms Beginning second test run... "AND <>" ET: 42326 ms
I ran the same series of tests on another occasion and the NOT IN query consistently outperformed the AND <> query. The results regularly go back and forth, much like the heads and tails of a coin toss. So, I have managed to convince myself that, despite the two execution times listed above, these two queries are, indeed, the same query as far as SQL Server is concerned -- at least on this day, on this server, for these queries (I still gravitate toward the comforting ambiguity of "it depends").
I mentioned earlier that there were several queries that outperform our basic AND <> and NOT IN queries (on this server on this day). Let's take a look at some of those queries and their execution results. The first alternative technique doesn't use a WHERE clause to filter out our integer values. It places the integer values into a UNION query and does a LEFT OUTER JOIN against that to filter out unequal rows. Here is what that query looks like:
SELECT @results = count(filterCriterion_sv) FROM tbl_IN_VS_AND LEFT OUTER JOIN ( SELECT 214 AS filterValue_val UNION SELECT 215 UNION SELECT 216 UNION SELECT 217 ) AS tbl ON tbl_IN_VS_AND.filterCriterion_sv = tbl.filterValue_val WHERE tbl.filterValue_val IS NULL
It definitely feels odd, placing things you would normally put in a WHERE clause into a derived table then looking for absent values, but the performance benefit gives us a compelling reason to consider doing this. On this test run of 100 executions, this odd query was consistently outperforming the quicker of our original two queries by about 19%:
Beginning fourth test run... "derived UNION table LEFT OUTER JOIN" ET: 34360 ms
Our last query happened to be our best performing (on this server, on this day). Like the previous query, this one uses a derived table. However, it takes it one step further and nests that inside an IF NOT EXISTS(). Let's take a look at it:
SELECT @results = count(filterCriterion_sv) FROM tbl_IN_VS_AND WHERE NOT EXISTS(SELECT * FROM ( SELECT 214 AS filterValue_val UNION ALL SELECT 215 UNION ALL SELECT 216 UNION ALL SELECT 217 ) AS tbl WHERE tbl.filterValue_val = tbl_IN_VS_AND.filterCriterion_sv )
And here is the time it took for 100 executions:
Beginning seventh test run... "NOT EXISTS from derived UNION table" ET: 27920 ms
On this day, on this server, this query runs a full 35% faster than the quicker of our two original queries. Are there even faster ways to run this query? I would say the odds are pretty good that there are. However, I must admit that we have exhausted the current depth of my knowledge and experience, and that was only thanks to the derived table and NOT EXISTS techniques I discovered in some old newsgroup postings [2] [3]. I hope you'll be able to successfully adapt these techniques to your environment and always remember to do a little experimenting and draw performance conclusions based on testing with your servers and your data.
Footnotes
[1] - http://en.wikipedia.org/wiki/Declarative_programming
[2] - http://groups.google.com/group/microsoft.public.sqlserver.programming/browse_thread/thread/34101ff50413c7d6/45a02da0dc7b3ced
[3] - http://groups.google.com/group/microsoft.public.sqlserver.server/browse_thread/thread/dfb1d7e766d89069/ee59b6e9d94ca222