in clause vs not in clause

Question

Post reply

in clause vs not in clause

am-244616

SSCarpal Tunnel

Points: 4936
More actions
November 3, 2008 at 11:58 am

#383288

Guys,
I have an employee table in which each employee is categorized by employee type (up to 100 types). Currently I have 50 mill rows
in the table I run the following query against the database.
SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL
FROM EMPLOYEE WHERE EMP_TYP IN (1, 2, 4, ....) - UPTO 70 EMP TYPES IN "IN CLAUSE"
If I would change the above query to do not "in clause"
SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL
FROM EMPLOYEE WHERE EMP_TYP not IN (11, 23, 49, ....) - UPTO 30 EMP TYPES IN "IN CLAUSE"
I did not notice any big difference in the performance. Is there any difference between how the SQL engine treats "IN CLAUSE" VS
"CLAUSE" shouldnt the second query be faster.
Any suggestions and inputs would help
Thanks

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

bang725 Mr or Mrs. 500 Points: 577 More actions · Answer 1

bang725

Mr or Mrs. 500

Points: 577

November 3, 2008 at 12:09 pm

#893177

hmmmm...

" Not in " will take more time.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 2

To answer this question, you should compare the execution plans of the two two queries. You might be surprised to see that both are doing a clustered index or table scan.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 3

bang725 (11/3/2008)
hmmmm...
" Not in " will take more time.

Heh... prove it. Got code? 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 4

Jeff Moden (11/3/2008)
bang725 (11/3/2008)
hmmmm...
" Not in " will take more time.
Heh... prove it. Got code? 😉

With no code, no idea of the data involved, my swag is both will take about the same amount of time as both will be doing a clustered index/table scan.

What say you Jeff, got a swag?

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 5

Yeah... my first swag would be that this might be a boondoggle... tell me which company in the world has 50 million employees as the op suggests. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 6

My second swag would be that, as you suggest, both will cause some sort of index scan.... but because the WHERE NOT IN has only 42.857% as many items to compare, the WHERE NOT IN will win the race.

But, let's stop guessing... I'm not gonna build a 50 million row test table, but I will build a million rows test table...

--===== Create and populate a 1,000,000 row test table.

-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

-- Column "SomeInt" has a range of 1 to 100 non-unique numbers

-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

-- for all rows.

-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

-- Jeff Moden

SELECT TOP 1000000

RowNum = IDENTITY(INT,1,1),

SomeInt = ABS(CHECKSUM(NEWID()))%100+1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),

SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

SomeHex12 = RIGHT(NEWID(),12)

INTO dbo.JBMTest

FROM Master.dbo.SysColumns t1

CROSS JOIN Master.dbo.SysColumns t2

--===== Add a Clustered Primary Key for the test

ALTER TABLE dbo.JBMTest

ADD PRIMARY KEY CLUSTERED (RowNum)

--===== Add an index on the lookup column for the test

CREATE INDEX IX_JBMTest_SomeInt ON dbo.JBMTest (SomeInt)

... and then we can play. I know that we could do this with "BETWEEN" or some such, but that wouldn't be testing what the op wanted to know... so here goes...

SET STATISTICS TIME ON

SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12

FROM dbo.JBMTest

WHERE SomeInt IN (1,2,3,4,5,6,7,8,9,10,

11,12,13,14,15,16,17,18,19,20,

21,22,23,24,25,26,27,28,29,30,

31,32,33,34,35,36,37,38,39,40,

41,42,43,44,45,46,47,48,49,50,

51,52,53,54,55,56,57,58,59,60,

61,62,63,64,65,66,67,68,69,70)

SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12

FROM dbo.JBMTest

WHERE SomeInt NOT IN (71,72,73,74,75,76,77,78,79,80,

81,82,83,84,85,86,87,88,89,90,

91,92,93,94,95,96,97,98,99,100)

SET STATISTICS TIME OFF

From that, I get the following results...

[font="Courier New"]

(699558 row(s) affected)

SQL Server Execution Times:

CPU time = 4641 ms, elapsed time = 35836 ms.

(699558 row(s) affected)

SQL Server Execution Times:

CPU time = 3469 ms, elapsed time = 35402 ms.[/font]

As you can see, the CPU didn't work very hard at all... the time to display is the killer and is about equal. But, do notice the CPU worked more than a second less on the WHERE NOT IN. Actual Execution Plan explains why.

The real key here is that no matter which method you use to return 70% of the rows of a 50 million table, you'll probably run the machine out of memory and the DBA is probably going to chase you until your heals smoke or (s)he manages to get a well aimed pork chop off. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RBarryYoung SSC Guru Points: 143329 More actions · Answer 7

Jeff Moden (11/4/2008)
Yeah... my first swag would be that this might be a boondoggel... tell me which company in the world has 50 million employees as the op suggests. 😉

China, Inc.? 😀

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 8

Heh... or maybe "Stormtrooper Clones'R'Us" 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

krayknot SSCertifiable Points: 7998 More actions · Answer 9

am (11/3/2008)
Guys,
I have an employee table in which each employee is categorized by employee type (up to 100 types). Currently I have 50 mill rows
in the table I run the following query against the database.
SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL
FROM EMPLOYEE WHERE EMP_TYP IN (1, 2, 4, ....) - UPTO 70 EMP TYPES IN "IN CLAUSE"
If I would change the above query to do not "in clause"
SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL
FROM EMPLOYEE WHERE EMP_TYP not IN (11, 23, 49, ....) - UPTO 30 EMP TYPES IN "IN CLAUSE"
I did not notice any big difference in the performance. Is there any difference between how the SQL engine treats "IN CLAUSE" VS
"CLAUSE" shouldnt the second query be faster.
Any suggestions and inputs would help
Thanks

It only depends on the number of rows you have in the dB and the number of resultant rows after applying the condition

kshitij kumar
kshitij@krayknot.com
www.krayknot.com

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 10

krayknot (11/4/2008)
It only depends on the number of rows you have in the dB and the number of resultant rows after applying the condition

Absolutely not true. Take a look at the test code I posted... both return the exact same number of rows.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 11

Actually Jeff - I do think there's something more afoot. I think you've managed to get it to use seeks because the IN's and the NOT IN's are all contiguous.

If you were to try:

SET STATISTICS TIME ON

SELECT count(RowNum)

FROM dbo.JBMTest

WHERE SomeInt IN (2,4,6,8,10,

12,14,16,18,20,

22,24,26,28,30,

32,34,36,38,40,

42,44,46,48,50,

52,54,56,58,60,

62,64,66,68,70,

72,74,76,78,80,

82,84,86,88,90,

92,94,96,98,100)

SELECT count(RowNum)

FROM dbo.JBMTest

WHERE SomeInt NOT IN (1,3,5,7,9,

11,13,15,17,19,

21,23,25,27,29,

31,33,35,37,39,

41,43,45,47,49,

51,53,55,57,59,

61,63,65,67,69,

71,73,75,77,79,

81,83,85,87,89,

91,93,95,97,99)

SELECT count(RowNum)

FROM dbo.JBMTest

WHERE SomeInt IN (1,3,5,7,9,

11,13,15,17,19,

21,23,25,27,29,

31,33,35,37,39,

41,43,45,47,49,

51,53,55,57,59,

61,63,65,67,69,

71,73,75,77,79,

81,83,85,87,89,

91,93,95,97,99)

SELECT count(RowNum)

FROM dbo.JBMTest

WHERE SomeInt NOT IN (2,4,6,8,10,

12,14,16,18,20,

22,24,26,28,30,

32,34,36,38,40,

42,44,46,48,50,

52,54,56,58,60,

62,64,66,68,70,

72,74,76,78,80,

82,84,86,88,90,

92,94,96,98,100)

SET STATISTICS TIME OFF

All of a sudden, you get:

--IN evens
SQL Server Execution Times:
CPU time = 63 ms, elapsed time = 65 ms.
--NOT IN odds
SQL Server Execution Times:
CPU time = 1170 ms, elapsed time = 595 ms.
--IN odds
SQL Server Execution Times:
CPU time = 62 ms, elapsed time = 62 ms.
--NOT IN evens.
SQL Server Execution Times:
CPU time = 1139 ms, elapsed time = 593 ms.

So - it's not always the same.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 12

Do your tests all come back with the same rowcounts? Remember... it's random data and SomeInt has different counts for different values.

And, no... I didn't get any Index Seeks... both were Clustered Index Scans.

I'll try your code to see what's going on there... thanks, Matt.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

Jeff Moden (11/5/2008)
Do your tests all come back with the same rowcounts? Remember... it's random data and SomeInt has different counts for different values.
And, no... I didn't get any Index Seeks... both were Clustered Index Scans.
I'll try your code to see what's going on there... thanks, Matt.

I forgot that you were returning most of the fields and a lot of the rows (we must be over the "tipping point" which is causing the scan). In mine - the NOT IN still keeps scanning, but the IN seeks.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 14

Heh... I see now... first, of course the rowcounts will be/are identical. And, of course your testing shows that WHERE IN will be much faster... you changed the problem from a row retrieval to a simple count. If you go back to the original problem of returning rows, WHERE NOT IN still wins even using your good 50/50 criteria... 🙂

SET STATISTICS TIME ON

SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12

FROM dbo.JBMTest

WHERE SomeInt IN (2,4,6,8,10,

12,14,16,18,20,

22,24,26,28,30,

32,34,36,38,40,

42,44,46,48,50,

52,54,56,58,60,

62,64,66,68,70,

72,74,76,78,80,

82,84,86,88,90,

92,94,96,98,100)

SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12

FROM dbo.JBMTest

WHERE SomeInt NOT IN (1,3,5,7,9,

11,13,15,17,19,

21,23,25,27,29,

31,33,35,37,39,

41,43,45,47,49,

51,53,55,57,59,

61,63,65,67,69,

71,73,75,77,79,

81,83,85,87,89,

91,93,95,97,99)

SET STATISTICS TIME OFF

[font="Courier New"]

(500741 row(s) affected)

SQL Server Execution Times:

CPU time = 4719 ms, elapsed time = 26690 ms.

(500741 row(s) affected)

SQL Server Execution Times:

CPU time = 3765 ms, elapsed time = 25579 ms.

[/font]

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)