November 3, 2008 at 11:58 am
Guys,
I have an employee table in which each employee is categorized by employee type (up to 100 types). Currently I have 50 mill rows
in the table I run the following query against the database.
SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL
FROM EMPLOYEE WHERE EMP_TYP IN (1, 2, 4, ....) - UPTO 70 EMP TYPES IN "IN CLAUSE"
If I would change the above query to do not "in clause"
SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL
FROM EMPLOYEE WHERE EMP_TYP not IN (11, 23, 49, ....) - UPTO 30 EMP TYPES IN "IN CLAUSE"
I did not notice any big difference in the performance. Is there any difference between how the SQL engine treats "IN CLAUSE" VS
"CLAUSE" shouldnt the second query be faster.
Any suggestions and inputs would help
Thanks
November 3, 2008 at 12:09 pm
hmmmm...
" Not in " will take more time.
November 3, 2008 at 12:10 pm
To answer this question, you should compare the execution plans of the two two queries. You might be surprised to see that both are doing a clustered index or table scan.
November 3, 2008 at 9:05 pm
bang725 (11/3/2008)
hmmmm..." Not in " will take more time.
Heh... prove it. Got code? 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
November 3, 2008 at 9:08 pm
Jeff Moden (11/3/2008)
bang725 (11/3/2008)
hmmmm..." Not in " will take more time.
Heh... prove it. Got code? 😉
With no code, no idea of the data involved, my swag is both will take about the same amount of time as both will be doing a clustered index/table scan.
What say you Jeff, got a swag?
November 4, 2008 at 8:06 pm
Yeah... my first swag would be that this might be a boondoggle... tell me which company in the world has 50 million employees as the op suggests. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
November 4, 2008 at 8:22 pm
My second swag would be that, as you suggest, both will cause some sort of index scan.... but because the WHERE NOT IN has only 42.857% as many items to compare, the WHERE NOT IN will win the race.
But, let's stop guessing... I'm not gonna build a 50 million row test table, but I will build a million rows test table...
--===== Create and populate a 1,000,000 row test table.
-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers
-- Column "SomeInt" has a range of 1 to 100 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
-- Jeff Moden
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID()))%100+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.JBMTest
FROM Master.dbo.SysColumns t1
CROSS JOIN Master.dbo.SysColumns t2
--===== Add a Clustered Primary Key for the test
ALTER TABLE dbo.JBMTest
ADD PRIMARY KEY CLUSTERED (RowNum)
--===== Add an index on the lookup column for the test
CREATE INDEX IX_JBMTest_SomeInt ON dbo.JBMTest (SomeInt)
... and then we can play. I know that we could do this with "BETWEEN" or some such, but that wouldn't be testing what the op wanted to know... so here goes...
SET STATISTICS TIME ON
SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12
FROM dbo.JBMTest
WHERE SomeInt IN (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,
41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,
61,62,63,64,65,66,67,68,69,70)
SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12
FROM dbo.JBMTest
WHERE SomeInt NOT IN (71,72,73,74,75,76,77,78,79,80,
81,82,83,84,85,86,87,88,89,90,
91,92,93,94,95,96,97,98,99,100)
SET STATISTICS TIME OFF
From that, I get the following results...
[font="Courier New"]
(699558 row(s) affected)
SQL Server Execution Times:
CPU time = 4641 ms, elapsed time = 35836 ms.
(699558 row(s) affected)
SQL Server Execution Times:
CPU time = 3469 ms, elapsed time = 35402 ms.[/font]
As you can see, the CPU didn't work very hard at all... the time to display is the killer and is about equal. But, do notice the CPU worked more than a second less on the WHERE NOT IN. Actual Execution Plan explains why.
The real key here is that no matter which method you use to return 70% of the rows of a 50 million table, you'll probably run the machine out of memory and the DBA is probably going to chase you until your heals smoke or (s)he manages to get a well aimed pork chop off. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
November 4, 2008 at 8:23 pm
Jeff Moden (11/4/2008)
Yeah... my first swag would be that this might be a boondoggel... tell me which company in the world has 50 million employees as the op suggests. 😉
China, Inc.? 😀
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
November 4, 2008 at 8:25 pm
Heh... or maybe "Stormtrooper Clones'R'Us" 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
November 4, 2008 at 9:27 pm
am (11/3/2008)
Guys,I have an employee table in which each employee is categorized by employee type (up to 100 types). Currently I have 50 mill rows
in the table I run the following query against the database.
SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL
FROM EMPLOYEE WHERE EMP_TYP IN (1, 2, 4, ....) - UPTO 70 EMP TYPES IN "IN CLAUSE"
If I would change the above query to do not "in clause"
SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL
FROM EMPLOYEE WHERE EMP_TYP not IN (11, 23, 49, ....) - UPTO 30 EMP TYPES IN "IN CLAUSE"
I did not notice any big difference in the performance. Is there any difference between how the SQL engine treats "IN CLAUSE" VS
"CLAUSE" shouldnt the second query be faster.
Any suggestions and inputs would help
Thanks
It only depends on the number of rows you have in the dB and the number of resultant rows after applying the condition
kshitij kumar
kshitij@krayknot.com
www.krayknot.com
November 5, 2008 at 6:08 pm
krayknot (11/4/2008)
It only depends on the number of rows you have in the dB and the number of resultant rows after applying the condition
Absolutely not true. Take a look at the test code I posted... both return the exact same number of rows.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 5, 2008 at 6:52 pm
Actually Jeff - I do think there's something more afoot. I think you've managed to get it to use seeks because the IN's and the NOT IN's are all contiguous.
If you were to try:
SET STATISTICS TIME ON
SELECT count(RowNum)
FROM dbo.JBMTest
WHERE SomeInt IN (2,4,6,8,10,
12,14,16,18,20,
22,24,26,28,30,
32,34,36,38,40,
42,44,46,48,50,
52,54,56,58,60,
62,64,66,68,70,
72,74,76,78,80,
82,84,86,88,90,
92,94,96,98,100)
SELECT count(RowNum)
FROM dbo.JBMTest
WHERE SomeInt NOT IN (1,3,5,7,9,
11,13,15,17,19,
21,23,25,27,29,
31,33,35,37,39,
41,43,45,47,49,
51,53,55,57,59,
61,63,65,67,69,
71,73,75,77,79,
81,83,85,87,89,
91,93,95,97,99)
SELECT count(RowNum)
FROM dbo.JBMTest
WHERE SomeInt IN (1,3,5,7,9,
11,13,15,17,19,
21,23,25,27,29,
31,33,35,37,39,
41,43,45,47,49,
51,53,55,57,59,
61,63,65,67,69,
71,73,75,77,79,
81,83,85,87,89,
91,93,95,97,99)
SELECT count(RowNum)
FROM dbo.JBMTest
WHERE SomeInt NOT IN (2,4,6,8,10,
12,14,16,18,20,
22,24,26,28,30,
32,34,36,38,40,
42,44,46,48,50,
52,54,56,58,60,
62,64,66,68,70,
72,74,76,78,80,
82,84,86,88,90,
92,94,96,98,100)
SET STATISTICS TIME OFF
All of a sudden, you get:
--IN evens
SQL Server Execution Times:
CPU time = 63 ms, elapsed time = 65 ms.
--NOT IN odds
SQL Server Execution Times:
CPU time = 1170 ms, elapsed time = 595 ms.
--IN odds
SQL Server Execution Times:
CPU time = 62 ms, elapsed time = 62 ms.
--NOT IN evens.
SQL Server Execution Times:
CPU time = 1139 ms, elapsed time = 593 ms.
So - it's not always the same.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 5, 2008 at 7:13 pm
Do your tests all come back with the same rowcounts? Remember... it's random data and SomeInt has different counts for different values.
And, no... I didn't get any Index Seeks... both were Clustered Index Scans.
I'll try your code to see what's going on there... thanks, Matt.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 5, 2008 at 7:22 pm
Jeff Moden (11/5/2008)
Do your tests all come back with the same rowcounts? Remember... it's random data and SomeInt has different counts for different values.And, no... I didn't get any Index Seeks... both were Clustered Index Scans.
I'll try your code to see what's going on there... thanks, Matt.
I forgot that you were returning most of the fields and a lot of the rows (we must be over the "tipping point" which is causing the scan). In mine - the NOT IN still keeps scanning, but the IN seeks.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 5, 2008 at 7:23 pm
Heh... I see now... first, of course the rowcounts will be/are identical. And, of course your testing shows that WHERE IN will be much faster... you changed the problem from a row retrieval to a simple count. If you go back to the original problem of returning rows, WHERE NOT IN still wins even using your good 50/50 criteria... 🙂
SET STATISTICS TIME ON
SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12
FROM dbo.JBMTest
WHERE SomeInt IN (2,4,6,8,10,
12,14,16,18,20,
22,24,26,28,30,
32,34,36,38,40,
42,44,46,48,50,
52,54,56,58,60,
62,64,66,68,70,
72,74,76,78,80,
82,84,86,88,90,
92,94,96,98,100)
SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12
FROM dbo.JBMTest
WHERE SomeInt NOT IN (1,3,5,7,9,
11,13,15,17,19,
21,23,25,27,29,
31,33,35,37,39,
41,43,45,47,49,
51,53,55,57,59,
61,63,65,67,69,
71,73,75,77,79,
81,83,85,87,89,
91,93,95,97,99)
SET STATISTICS TIME OFF
[font="Courier New"]
(500741 row(s) affected)
SQL Server Execution Times:
CPU time = 4719 ms, elapsed time = 26690 ms.
(500741 row(s) affected)
SQL Server Execution Times:
CPU time = 3765 ms, elapsed time = 25579 ms.
[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply