This post comes off the back of my last, where I looked at issues caused by explicitly declaring a large number of values in an IN clause. The query processor ran out of internal resources and could not produce a query plan – When You’ve Managed to Confuse SQL With a Crazy IN Clause.
My suggestion was to put the values into a temp table and carry out a semi-join.
That got me thinking, which of the two methods would perform better.
I’m going to look at the two methods for different amounts of values and have a look at how they get on in terms of reads and total time.
1 Value
Lets start out with just a single value in the IN clause. I’ll start off with the following query
SELECT ID
FROM person
WHERE ID IN (266238)
Execution plan was as below. The value in the IN clause has been moved into the predicate of the seek, that makes total sense. It performed 3 Reads and took 1ms.
How did it do?
Reads: 3
Time: 1ms
Temp Table
Now let’s look at creating a temp table
CREATE TABLE #id (ID INT)
INSERT INTO #id (ID)
VALUES(266238)
SELECT ID
FROM person
WHERE ID IN (SELECT ID FROM #id
Straight away we can see that there’s more going on when we use the temp table. We’ve got an extra table scan of the temp going on and of course a necessary join. But how much has that extra work affected performance.
Reads: 5
Time: 13ms
So that’s interesting, with a single row, it’s more efficient to explicitly state your value in the IN clause than it is to force SQL to do a bunch of extra work with a temp table.
But What About Multiple Values?
Lets do the some thing but this time with 10 values
10 Values
SELECT ID
FROM person
WHERE ID IN (841455,
435621,
701019,
676699,
638892,
553042,
87037,
930664,
351649,
789353)
The execution plan is exactly the same as with a single value. The IN clause has been pushed down into the predicate of the seek. Let’s take a quick look at that predicate…
So we can see that SQL’s just translated it into multiple OR arguments. Nice, but how did it perform?
Reads: 30
Time: 7m
Now for the same thing but this time we’ll pop the values into a temp table…
CREATE TABLE #id (ID INT)
INSERT INTO #id (ID)
VALUES(841455),
(435621),
(701019),
(676699),
(638892),
(553042),
(87037),
(930664),
(351649),
(789353)
SELECT ID
FROM person
WHERE ID IN (SELECT ID FROM #id)
Reads: 41
Time: 17m
With 10 values we’re still seeing things performing better by putting the values into the IN clause.
But how well does this scale, let’s carry on increasing the number of values and check it out.
Let’s add extra values up to 100 values
So it looks like popping a list of values in the IN clause is still managing to beat using a temp table. That’s interesting, I had honestly expected the opposite to be true. Also, look at the graph at 70 values, something’s going on there. We’re seeing a bump in reads but a reduction in time for the explicit values.
Let’s take a look at the execution plane and see if something’s changed….
Now that’s interesting, SQL’s decided stop putting the values directly into the predicate and is now doing a constant scan. Essentially doing the same as we’re doing with the temp table but managing to do it more efficiently.
Shall we scale it out further, up to 10,000 values?
A bit of load and things are starting to get interesting. When it comes to reads, there’s not really much difference at all (the ‘IN explicit values’ line appears to disappear but that’s because the two are virtually identical). But look at time, from around 1000 values we’re starting to see far better performance from the temp table.
What happened at 4000 rows?
What’s going on at 4000 rows? There’s a huge drop in reads, they pretty much halve for both the temp table and explicit values datasets. At the same time we see a slight increase in time for them both.
Let’s take a look at the execution plans at 4000 rows.
In both cases, SQL has shifted away from the nested loops that it was using and is now sorting the data and doing a merge join. Another example of the optimiser at work and finding a better(?) plan at different loads.
Let’s go silly and push out to 50,000 values
As you can see, once we start getting out to these big numbers of values, things are really starting to go bad for the explicitly defined IN clause values. Run times are falling behind quickly even though reads stays pretty much constant (I’m guessing that there’s more going on with that constant scan behind the scenes that we’re not seeing).
Conclusion
It’s probably no huge surprise to discover that if we’ve got a large number of values, we’re better off dropping them into a temp table and using that in our IN clause rather that define them all explicitly as part of a huge IN clause.
What was quite interesting though was the point at which using a temp table became noticeably better. It wasn’t until we get to around 1000 values that the temp table approach really start taking over. Until then, they’re pretty much comparable.
One very interesting thing to notice when using a temp table is that from around 4000 values, there’s very little increase in run time as the values increase. It looks like that approach will scale very well. Even with 50,000 values it was still only taking a little over half a second compared to over 30 seconds when using an explicit list of values.
This was just a very quick test to satisfy my own curiosity using a simple dataset. With different datasets, you may well see different results.
Let’s be really silly and try this with a million rows
and yes I have really just created a script with a million values in the IN clause, I won’t post it up here
Msg 701, Level 17, State 123, Line 19
There is insufficient system memory in resource pool ‘default’ to run this query.
opps, ok well that didn’t work. Worth a try
Although, using the temp table and 1 million values, it completed in just under 3 seconds with 8320 reads. Pretty solid performance.
If you’ve got this far thanks for reading.
Full results
In case you’re interested, here are the full results of the tests that I did.