If you’re a DBA or database developer, chances are you will have to write or rewrite queries from time to time. Rewriting may be for new features or simply to improve the performance of existing code.
There are tens of thousands of tips and tricks for tuning. Just about every one of them you can find online. Though I still see many developers writing queries in less than efficient ways. I’ve even been told that tuning doesn’t matter much anymore since the advent of SSD and FLASH storage. If only this were true.
Today I’d like to discuss IN (Transact-SQL) and a neat tip that I learned from Joe Sack. Using IN with a list of values is the same as a list of OR’s.
Let’s take a look at the following queries:
IN with 15 vs 16 values
SELECT SalesOrderID, ProductID, UnitPrice
FROM Sales.SalesOrderDetail
WHERE ModifiedDate in (‘2007-01-05’, ‘2007-01-17’, ‘2007-01-26’, ‘2007-02-01’, ‘2007-02-04’, ‘2007-02-14’, ‘2007-02-28’, ‘2007-03-06’, ‘2007-03-07’, ‘2007-03-10’, ‘2007-04-15’, ‘2007-04-17’, ‘2007-08-08’, ‘2007-11-11’, ‘2007-12-25’)
Vs
SELECT SalesOrderID, ProductID, UnitPrice
FROM Sales.SalesOrderDetail
WHERE ModifiedDate in (‘2007-01-05′, ‘2007-01-17’, ‘2007-01-26’, ‘2007-02-01’, ‘2007-02-04’, ‘2007-02-14’, ‘2007-02-28’, ‘2007-03-06’, ‘2007-03-07’, ‘2007-03-10’, ‘2007-04-15’, ‘2007-04-17’, ‘2007-08-08’, ‘2007-11-11’, ‘2007-12-25’, ‘2007-12-26’)
Normally, you’d think that these two queries would run with the exact same plan. After all, they are the same query but one has 16 values in the IN while the other only has 15. Let’s examine the two plans.
As you can see the second query is much slower and the extra value in the IN caused late filtering. This is a limitation on some types of operators such as this clustered index scan.
There isn’t just a limitation of 15 input values. There’s also one at 64. On the 65th input value the list will be converted to a constant scan which is then sorted and joined. Most interestingly enough is that the list in my demo query is already sorted ascending.
IN with 64 vs 65 values
SELECT DISTINCT city
FROM Person.Address
WHERE city in (
‘Boulogne-Billancourt’,’Boulogne-sur-Mer’,’Bountiful’,‘Bracknell’,’Bradenton’,’Braintree’,’Brampton’,
‘Branch’,’Branson’,’Braunschweig’,’Bremerton’,
‘Brisbane’,’Brossard’,’Burbank’,’Burien’,
‘Burlingame’,’Burnaby’,’Bury’,’Byron’,’Calgary’,
‘Caloundra’,’Camarillo’,’Cambridge’,’Campbellsville’,
‘Canoga Park’,’Carnation’,’Carol Stream’,’Carrollton’,
‘Carson’,’Casper’,’Cedar City’,’Cedar Park’,
‘Central Valley’,’Cergy’,’Cerritos’,’Chalk Riber’,
‘Chandler’,’Chantilly’,’Charlotte’,’Chatou’,
‘Cheektowaga’,’Chehalis’,’Cheltenham’,’Cheyenne’,
‘Chicago’,’Chula Vista’,’Cincinnati’,’Citrus Heights’,
‘City Of Commerce’,’Clackamas’,’Clarkston’,’Clay’,
‘Clearwater’,’Cliffside’,’Cloverdale’,’Coffs Harbour’,
‘College Station’,’Colma’,’Colombes’,’Colomiers’,
‘Columbus’,’Concord’,’Coronado’,’Corpus Christi’
)
Vs
SELECT DISTINCT city
FROM Person.Address
WHERE city in (
‘Boulogne-Billancourt’,’Boulogne-sur-Mer’,’Bountiful’,‘Bracknell’,’Bradenton’,’Braintree’,’Brampton’,
‘Branch’,’Branson’,’Braunschweig’,’Bremerton’,
‘Brisbane’,’Brossard’,’Burbank’,’Burien’,
‘Burlingame’,’Burnaby’,’Bury’,’Byron’,’Calgary’,
‘Caloundra’,’Camarillo’,’Cambridge’,’Campbellsville’,
‘Canoga Park’,’Carnation’,’Carol Stream’,’Carrollton’,
‘Carson’,’Casper’,’Cedar City’,’Cedar Park’,
‘Central Valley’,’Cergy’,’Cerritos’,’Chalk Riber’,
‘Chandler’,’Chantilly’,’Charlotte’,’Chatou’,
‘Cheektowaga’,’Chehalis’,’Cheltenham’,’Cheyenne’,
‘Chicago’,’Chula Vista’,’Cincinnati’,’Citrus Heights’,
‘City Of Commerce’,’Clackamas’,’Clarkston’,’Clay’,
‘Clearwater’,’Cliffside’,’Cloverdale’,’Coffs Harbour’,
‘College Station’,’Colma’,’Colombes’,’Colomiers’,
‘Columbus’,’Concord’,’Coronado’,’Corpus Christi’,’Seattle’
)
Here we see a significant difference between the two query plans by simply adding ‘Seattle’ to the list.
How do we work around these limitations?
I’ve always liked joining lists instead of using IN and NOT IN. For IN a simple INNER JOIN works great. For NOT IN a LEFT JOIN WHERE VALUE IS NULL works great as well. Let’s look at the IN
DECLARE @in TABLE (city nvarchar(60))
INSERT INTO @in
VALUES (‘Boulogne-Billancourt’),(‘Boulogne-sur-Mer’),(‘Bountiful’),(‘Bracknell’),(‘Bradenton’),(‘Braintree’),(‘Brampton’),
(‘Branch’),(‘Branson’),(‘Braunschweig’),(‘Bremerton’),
(‘Brisbane’),(‘Brossard’),(‘Burbank’),(‘Burien’),
(‘Burlingame’),(‘Burnaby’),(‘Bury’),(‘Byron’),(‘Calgary’),
(‘Caloundra’),(‘Camarillo’),(‘Cambridge’),(‘Campbellsville’),
(‘Canoga Park’),(‘Carnation’),(‘Carol Stream’),(‘Carrollton’),
(‘Carson’),(‘Casper’),(‘Cedar City’),(‘Cedar Park’),
(‘Central Valley’),(‘Cergy’),(‘Cerritos’),(‘Chalk Riber’),
(‘Chandler’),(‘Chantilly’),(‘Charlotte’),(‘Chatou’),
(‘Cheektowaga’),(‘Chehalis’),(‘Cheltenham’),(‘Cheyenne’),
(‘Chicago’),(‘Chula Vista’),(‘Cincinnati’),(‘Citrus Heights’),
(‘City Of Commerce’),(‘Clackamas’),(‘Clarkston’),(‘Clay’),
(‘Clearwater’),(‘Cliffside’),(‘Cloverdale’),(‘Coffs Harbour’),
(‘College Station’),(‘Colma’),(‘Colombes’),(‘Colomiers’),
(‘Columbus’),(‘Concord’),(‘Coronado’),(‘Corpus Christi’),(‘Seattle’)
SELECT DISTINCT a.city
FROM Person.Address a
INNER JOIN @in i on (i.city = a.City)
Since we don’t have a table to join with already we create a table variable and insert the 65 values. Then we join the table variable to our query and you can see that this approach appears to cost much less. Notice that the IN forces an index scan while the join uses an index seek.
Summary
I hope you’ve found this post useful. We all know the saying “It depends”. It’s a phrase to live by in SQL Server and one you should always keep in mind. Remember that adequate testing is always needed for any proper tuning job.
When tuning or writing new queries take this tip into account. IN and OR can be limited and these limitations can slow your query. Test with an IN and with a JOIN to see which is better for your environment.
If you liked this post you may want to read some more from me at: http://www.sqlservercentral.com/blogs/confessions-of-a-microsoft-addict/ and https://sqltechblog.com/