Efficiency of IN vs. other ways to filter rows

Question

Post reply

Efficiency of IN vs. other ways to filter rows

johnsql-193053

SSCommitted

Points: 1657
More actions
August 7, 2006 at 8:28 am

#113841

Knights,
I just wonder whether I should use IN or other better ways to filter rows based on a constraint set.
I have tables and their contents look like:
declare @YearStart varchar(50), @companyID int, @SYTermID int
set @companyID = 1370
set @SYTermID = 8
set @YearStart = dbo.fn_GetSchoolStartYearOfATerm (@companyID, @SYTermID)
--1/ @YearStart for selected year
select @YearStart as '@YearStart'
@YearStart
--------------------------------------------------
2005
(1 row(s) affected)
--2/ table 'Term' and select only termID belonging to @YearStart
declare @tt table (companyID int, termID int, description varchar(50), StartDate datetime, EndDate datetime)
insert @tt
select companyID, termID, description, StartDate,EndDate
from term where companyID=@companyID
and dbo.fn_GetSchoolStartYearOfATerm (@companyID, termID)=@YearStart
--> 2A/ check @tt
select * from @tt
companyID termID description StartDate EndDate
----------- ----------- -------------------------------------------------- ------------------------------------------------------ ------------------------------------------------------
1370 8 2005 - 2006 2005-08-26 00:00:00.000 2006-08-15 00:00:00.000
1370 9 Sem 1 05/06 2005-08-26 00:00:00.000 2006-01-14 00:00:00.000
1370 11 Q1 05-06 2005-08-26 00:00:00.000 2005-10-21 00:00:00.000
1370 12 Q2 05-06 2005-10-24 00:00:00.000 2006-01-14 00:00:00.000
1370 13 Q3 05-06 2006-01-16 00:00:00.000 2006-03-17 00:00:00.000
1370 14 Sem 2 05/06 2006-01-16 00:00:00.000 2006-08-15 00:00:00.000
1370 15 Q4 05-06 2006-03-27 00:00:00.000 2006-08-15 00:00:00.000
1370 26 Progress Q1 2005-08-26 00:00:00.000 2005-10-21 00:00:00.000
(8 row(s) affected)
--> 2B/ and termIDs belonging to @YearStart of '2005'
select companyID, termID from @tt
companyID termID
----------- -----------
1370 8
1370 9
1370 11
1370 12
1370 13
1370 14
1370 15
1370 26
(8 row(s) affected)
-- 3/ table 'Class' in general (not filetered by termIDs of @tt yet)
select * from Class where companyID=@companyID
companyID classID termID CourseID
----------- ----------- ----------- -----------
1370 14 1 25
1370 15 1 26
1370 16 1 27
1370 17 1 18
1370 18 1 19
1370 19 1 20
1370 45 1 17
1370 46 1 29
1370 272 8 60
1370 47 1 18
1370 48 5 22
1370 49 2 21
1370 50 2 38
1370 275 8 50
1370 276 9 57
1370 277 14 107
1370 278 9 54
-- 4/ table 'Class' and select only class belonging to entried of table @tt.
-- The results I want as follows using IN
select companyID, classID, TermID, CourseID from Class
where companyID=@companyID and termID=@SYTermID
and termID IN (select termID from @tt)
1370 272 8 60
1370 275 8 50
1370 276 9 57
1370 277 14 107
1370 278 9 54
My questions:
A/ I should use IN or some other BETTER ways for query? runing time? Advantages?...
B/ Any ideas about the drawbacks of IN?
C/ In which situations and scenarios, IN should be used in case other ways cannot be applied?
Thanks so much for any ideas, discussions.

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Vladan SSC-Insane Points: 22072 More actions · Answer 1

Just my opinion, based on experience in our environment:

- IN is way to go if you want to name a few values : WHERE user.login IN ('tom','bob','mike')

- IN is absolutely horrible solution in cases where the number of values goes into several hundreds or thousands, if you list the values explicitly (not as SELECT value FROM table).. this can happen if the code is not in SQL, but part of application and the final construct containing several hundred/thousand values is then sent to SQL server. Terrible performance, and I wouldn't want to maintain and troubleshoot such code.

- about other situations - better test it in your environment, but any time you can find another working and well performing solution based on JOIN to a table, I would prefer it over IN. However, in some cases we found that IN performs quite well if the values are supplied as a SELECT statement. Also be aware that if you are writing a code that will be used by several clients in different environment, some of them can have problems while other are OK... in my opinion, JOIN is much safer.

/*EDIT*/

As to your example, I would write the query like this:

SELECT Class.companyID, Class.classID, Class.TermID, Class.CourseID

FROM Class

JOIN @tt ON @tt.termID = Class.TermID

WHERE companyID=@companyID AND termID=@SYTermID

David.Poole SSC Guru Points: 76361 More actions · Answer 2

Sometimes an EXISTS clause can give the desired results.

Let us suppose that we have a sales territory broken down into

Country - > Region - > Area.

The marketing dept are forever ~~buggering around with~~ realigning the various levels of the hierarchy but the application must not list a region that does not have areas.

You may select Regions WHERE IN (Select RegionID FROM dbo.Area AS A WHERE A.RegionID = R.RegionID)

The quicker method would be

WHERE EXISTS (Select RegionID FROM dbo.Area AS A WHERE A.RegionID = R.RegionID)

LinkedIn Profile

johnsql-193053 SSCommitted Points: 1657 More actions · Answer 3

Knights,

I applied the 3 ways to do the query, in order, IN, EXISTS, INNER JOIN. They give me the same DTS results

-- 4/A table 'Class' and select only class belonging to entried of table @tt

-- IN

select companyID, classID, TermID, CourseID from Class

where companyID=@companyID

and termID IN (select termID from @tt)

-- 4B/ using Exists

-- Exists

select companyID, classID, TermID, CourseID from Class cl

where cl.companyID=@companyID

and exists (select termID from @tt where termID=cl.termID)

-- 4C/ using inner join

-- INNER JOIN

select cl.companyID, cl.classID, t.TermID, cl.CourseID from Class cl

inner join @tt t on t.termID=cl.termID

where cl.companyID=@companyID

But when I refer to the Execution Plan, I get the overheads for the 3 method, in order:

Query 9: Query Cost (relative the bath): 19.88 %

Query 10: Query Cost (relative the bath): 19.88 %

Query 10: Query Cost (relative the bath): 20.42 %

Any ideas, dicussions?

Thanks in advance.

PW-201837 SSC-Insane Points: 20805 More actions · Answer 4

Couple of questions & points:

Use DBCC DropCleanBuffers before each test so that you're running with a 'cold' SQL data cache each time. You want perf differences to be related to the queries, not dependent on which tables are already read from disk and cached.

Is this the entire contents of a DTS exec SQL task ? If so, is @tt used for anything else ? If not, just make it a derived table and join to it.

Are the data volumes you've posted typical and representative ? i.e. will @tt typically have a small number of rows like 8, and not orders of magnitude more ? If yes, I don't think you need to worry about the details of how you express this. The differences between EXISTS, IN or JOIN'ing won't really become apparent until you have significantly more data than this.

Robert Davis One Orange Chip Points: 28027 More actions · Answer 5

The Exists clause is a lot faster than the IN clause for a very simple reason. With an IN clause, SQL Server gets the entire dataset returned by the query and processes it for matches. With an Exists clause, it processes the matches as part of the query and doesn't need to consider the resultset as whole individual. The same is true with an Inner Join.

Inner Join can be as fast or nearly as fast or nearly as fast as an Exists clause or sometimes even faster. Ocasionally, an Inner Join can be a lot slower because there are multiple matches for each record and the resulting recordset is a lot larger than it would be with an Exists clause. Also, if you are merely doing "Select *" the row size will be larger with an Inner Join and this will slow down the actual returning of data.

Your Exists clause can be sped up even more. When doing a subquery in an Exists clause, use Select 1 instead of Select * or Select FieldName. This is faster as SQL Server recognizes that it doesn't need to send back any data, just a constant value.

select companyID, classID, TermID, CourseID from Class cl

where cl.companyID=@companyID

and exists (select 1 from @tt where termID=cl.termID)

My blog: SQL Soldier[/url]
SQL Server Best Practices: SQL Server Best Practices
Twitter: @SQLSoldier
My book: Pro SQL Server 2008 Mirroring[/url]
Microsoft Certified Master: SQL Server, Data Platform MVP
Database Engineer at BlueMountain Capital Management[/url]

PW-201837 SSC-Insane Points: 20805 More actions · Answer 6

>> When doing a subquery in an Exists clause, use Select 1 instead of Select * or Select FieldName. This is faster as SQL Server recognizes that it doesn't need to send back any data, just a constant value.

That was true in version 6.5 and earlier, due to a bug/feature of the optimizer as inherited from Sybase.

It hasn't made a difference for quite a while, the query plan will be exactly the same. The optimizer now recognizes that what is being selected is irrelevant to determining existence.

Robert Davis One Orange Chip Points: 28027 More actions · Answer 7

I don't know if it is still true in SQL 2005 or not, but it is still true in SQL 2000.

My blog: SQL Soldier[/url]
SQL Server Best Practices: SQL Server Best Practices
Twitter: @SQLSoldier
My book: Pro SQL Server 2008 Mirroring[/url]
Microsoft Certified Master: SQL Server, Data Platform MVP
Database Engineer at BlueMountain Capital Management[/url]

PW-201837 SSC-Insane Points: 20805 More actions · Answer 8

*shrug*

Using the Pubs database on SQL2K, all 3 of these have an identical query plan:

select *

from authors as a1

where exists (select * from authors as a2

where a1.au_lname = a2.au_lname

and a1.au_id <> a2.au_id)

select *

from authors as a1

where exists (select 1 from authors as a2

where a1.au_lname = a2.au_lname

and a1.au_id <> a2.au_id)

select *

from authors as a1

where exists (select au_lname from authors as a2

where a1.au_lname = a2.au_lname

and a1.au_id <> a2.au_id)

I stopped using the "Select 1" trick years ago because it's no longer necessary.

[Edit] Note, the key point is that in all cases, an index seek is used on a2. We don't care what the data is, just whether 1 row exists.

The old optimizer bug that was fixed in ver 7.0, was that the query plan would change with "Select *" or "Select ColName", and sometimes the better performing index seek would be abandoned in exchange for a different plan to get data pages, pages which were subsequently discarded because the subquery doesn't care about data, just true/false.