Why does this WHERE condition take forever to execute?

Question

Post reply

Why does this WHERE condition take forever to execute?

Bård Romstad

SSC Enthusiast

Points: 160
More actions
February 3, 2009 at 1:30 pm

#130272

This is probably basic stuff, but table1 (~50 000 rows) has an integer field (SomeIntVal) that references the identity field in table2 (~50 000 rows). Integrity is not enforced so values in table1.SomeIntVal may be NULL or non existing in table2.id.
I want to query all the "dirty" rows in table1:
SELECT count(*)
FROM table1
WHERE SomeIntVal IS NULL OR SomeIntVal NOT IN (SELECT id FROM table2)
Why does this query take forever to execute?
If I query either of the conditions alone it executes in less than a second.
SELECT count(*)
FROM table1
WHERE SomeIntVal IS NULL
SELECT count(*)
FROM table1
WHERE SomeIntVal NOT IN (SELECT id FROM table2)
A full example below, this one took about 7 seconds for the last query
--Create a table
CREATE TABLE #table1 (ID INT IDENTITY,sparsekey INT)
--Create a simple lookup table
CREATE TABLE #table2 (ID INT IDENTITY,someval INT)
GO
--Populate table2
INSERT INTO #table2
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
GO
INSERT INTO #table2
SELECT a.someval FROM #table2 a, #table2 b, #table2 c
ORDER BY c.someval, b.someval, a.someval
GO
INSERT INTO #table2
SELECT a.someval FROM #table2 a, #table2 b, #table2 c
ORDER BY c.someval, b.someval, a.someval
GO
--Populate the field sparsekey in table1 with sparse values that sometimes references #table2
INSERT INTO #table1
SELECT TOP 2000 CASE someval
WHEN 1 THEN NULL
WHEN 2 THEN 0
ELSE ID
END
FROM #table2
GO
DECLARE @StartTime datetime
--Get the count of sparsekey values that are NULL
SET @StartTime = GETDATE()
DECLARE @RetVal int
SELECT @RetVal = count(*) FROM #table1
WHERE sparsekey IS NULL
SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q1
--Get the count of sparsekey values that are NOT IN #table2
SET @StartTime = GETDATE()
SELECT @RetVal = count(*) FROM #table1
WHERE sparsekey NOT IN (SELECT ID FROM #table2)
SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q2
--Get the count of sparsekey values that are NULL OR NOT IN #table2
SET @StartTime = GETDATE()
SELECT @RetVal = count(*) FROM #table1
WHERE sparsekey IS NULL
OR sparsekey NOT IN (SELECT ID FROM #table2)
SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3
GO
DROP TABLE #table1
DROP TABLE #table2
GO

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

vishal.gamji SSCertifiable Points: 6126 More actions · Answer 1

Try changing your where clause to use an exists

WHERE SomeIntVal IS NULL OR SomeIntVal NOT EXISTS(SELECT 1 FROM FROM table2 WHERE id = SomeIntVal)

hope you have SomeIntVal indexed.

Bård Romstad SSC Enthusiast Points: 160 More actions · Answer 2

There you go. I knew there was a simple solution to this.

I guess I could also drop the first condition here as it will be redundant if id in table2 is NOT NULL

WHERE SomeIntVal IS NULL OR SomeIntVal NOT EXISTS(SELECT 1 FROM FROM table2 WHERE id = SomeIntVal)

Thanx.

Robert Ham-354502 SSC Veteran Points: 234 More actions · Answer 3

Using a left outer join will improve your performance. I added the NOT EXISTS and OUTER JOIN methods to your code.

--Get the count of sparsekey values that are NULL OR NOT IN #table2 using IN

SET @StartTime = GETDATE()

SELECT @RetVal = count(*) FROM #table1

WHERE sparsekey IS NULL

OR sparsekey NOT IN (SELECT ID FROM #table2)

SELECT 'Using IN ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

--Get the count of sparsekey values that are NULL OR NOT IN #table2 using NOT EXISTS

SET @StartTime = GETDATE()

SELECT @RetVal = count(*) FROM #table1

WHERE sparsekey IS NULL

OR NOT EXISTS(SELECT 1 FROM #table2 WHERE id = sparsekey)

SELECT 'Using NOT EXISTS ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

--Get the count of sparsekey values that are NULL OR NOT IN #table2 using left outer join

SET @StartTime = GETDATE()

SELECT @RetVal = count(*)

FROM #table1 as a

LEFT OUTER JOIN #table2 as b

ON sparsekey = b.ID

WHERE sparsekey IS NULL OR b.ID IS NULL

SELECT 'Using LEFT OUTER JOIN',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

results here are as follows:

method nrows ExecutionTime_Q3

--------------------- ----------- ----------------

Using IN 1990 8076

method nrows ExecutionTime_Q3

--------------------- ----------- ----------------

Using NOT EXISTS 1990 126

method nrows ExecutionTime_Q3

--------------------- ----------- ----------------

Using LEFT OUTER JOIN 1990 13

bvaljalo-1000038 Right there with Babe Points: 760 More actions · Answer 4

There's no question that using IN is the wrong option.

However, the choice between left outer join and the not exists clause appears a bit less 'cut and dry', IMHO. I suspect one method will be favorable under some conditions, and the other better in other conditions.

When I run the below (slightly modified version of your) script, using real tables with the ID's set to be primary keys (and thus indexed), and clearing the caches (like I was always told one should do) between running 'benchmarks' of this nature, I found the execution time of Using IN to run about 4600ms, whilst both of the other two options ran instantaneously (i.e. either 0 or 16ms).

Something interesting about this whole thing is ... If you turn on Display Actual Execution Plan, you'll discover that the Not In and Not Exists methods both end up using the exact same execution plan, yet our tests show that the former is orders of magnitude slower.

Also, strangely, if you add up the percentages on Cost %'s on both of these two methods, they add up to like 200% ... in fact, just the clustered index seek ALONE shows as cost: 100%. Whereas, in the AEP of the LOJ method, the percentages add to 100%, like they normally do.

Another strange thing is that in terms of query cost relative to the batch, the Not In and Not Exist method show up as 14% each, and the LOJ method relative cost is 72%...

CREATE TABLE table1 (ID INT IDENTITY,sparsekey INT, primary key(id))

--Create a simple lookup table

CREATE TABLE table2 (ID INT IDENTITY,someval INT, primary key(id))

--Populate table2

INSERT INTO table2

SELECT 1

UNION ALL

SELECT 2

UNION ALL

SELECT 3

INSERT INTO table2

SELECT a.someval FROM table2 a, table2 b, table2 c

ORDER BY c.someval, b.someval, a.someval

INSERT INTO table2

SELECT a.someval FROM table2 a, table2 b, table2 c

ORDER BY c.someval, b.someval, a.someval

--Populate the field sparsekey in table1 with sparse values that sometimes references table2

INSERT INTO table1

SELECT TOP 2000 CASE someval

WHEN 1 THEN NULL

WHEN 2 THEN 0

ELSE ID

END

FROM table2

GO

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

--Get the count of sparsekey values that are NULL OR NOT IN table2 using IN

DECLARE @StartTime datetime, @retval int

SET @StartTime = GETDATE()

SELECT @RetVal = count(*) FROM table1

WHERE sparsekey IS NULL

OR sparsekey NOT IN (SELECT ID FROM table2)

SELECT 'Using IN ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

GO

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

--Get the count of sparsekey values that are NULL OR NOT IN table2 using left outer join

DECLARE @StartTime datetime, @retval int

SET @StartTime = GETDATE()

SELECT @RetVal = count(*)

FROM table1 as a

LEFT OUTER JOIN table2 as b

ON sparsekey = b.ID

WHERE sparsekey IS NULL OR b.ID IS NULL

SELECT 'Using LEFT OUTER JOIN',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

GO

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

--Get the count of sparsekey values that are NULL OR NOT IN table2 using NOT EXISTS

DECLARE @StartTime datetime, @retval int

SET @StartTime = GETDATE()

SELECT @RetVal = count(*) FROM table1

WHERE sparsekey IS NULL

OR NOT EXISTS(SELECT 1 FROM table2 WHERE id = sparsekey)

SELECT 'Using NOT EXISTS ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

GO

DROP TABLE table1

DROP TABLE table2

GO

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 5

[font="Verdana"]Welcome to the joys of execution plans. :w00t:

From experience, the left outer join always tend to perform at least as well as the not exists. So I always tend to write that statement now as a left outer join.

I also know that the SQL Server optimiser does some smart things with optimising common forms of "not in" and "not exists", and can potentially generate much the same query plan for these. However, I read up on this a couple of years ago (with the introduction of SQL Server 2005) and I'm not exactly sure what happens there any more.

My recommendation: stick with the left outer join. It's hard to beat.

[/font]

Max-146500 SSCarpal Tunnel Points: 4691 More actions · Answer 6

Yip, second (or third) the use of joins. The WHERE clause can also be omitted by placing the condition in the join too. May I also suggest catering for presumed NULLS:

SELECT e.EmpID, s.nvcProdName

FROM t_Employee e

LEFT OUTER JOIN t_Sale s

ON e.EmpID = s.EmpID

AND COALESCE( s.nvcProdName, '' ) <> ''

Max

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 7

[font="Verdana"]Does that actually work?

I know in Oracle, if you tried to put the filter criteria for the right hand side being empty into the ON clause, it didn't work as expected.

I might be able to eliminate a few where clauses!

[/font]

Max-146500 SSCarpal Tunnel Points: 4691 More actions · Answer 8

Bruce W Cassidy (2/3/2009)
[font="Verdana"]Does that actually work?
I know in Oracle, if you tried to put the filter criteria for the right hand side being empty into the ON clause, it didn't work as expected.
I might be able to eliminate a few where clauses!
[/font]

I hope so, frowned upon by some WHERE clause purists though >:(

Please let me know more about the Oracle rhs problems, otherwise I may have to eliminate a few join conditions. :w00t: :hehe:

Max

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 9

[font="Verdana"]Oh, who knows. I haven't played with Oracle for a couple of years now, so maybe the best thing is to do the old "suck it and see" approach. I'll try putting the filter criteria into my join conditions on SQL Server. :cool:[/font]

Max-146500 SSCarpal Tunnel Points: 4691 More actions · Answer 10

Hi Bruce,

Gints wrote some informative tests on join criteria: http://www.gplivna.eu/papers/sql_join_types.htm#p6.4, thanks Gints, this does help clarify a lot.

Max

Jack Corbett SSC Guru Points: 184394 More actions · Answer 11

Max (2/3/2009)
Yip, second (or third) the use of joins. The WHERE clause can also be omitted by placing the condition in the join too. May I also suggest catering for presumed NULLS:
SELECT e.EmpID, s.nvcProdName
FROM t_Employee e
LEFT OUTER JOIN t_Sale s
ON e.EmpID = s.EmpID
AND COALESCE( s.nvcProdName, '' ) <> ''

The only problem with this is that now the query has to scan every row in t_Sale to apply the coalesce.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Max-146500 SSCarpal Tunnel Points: 4691 More actions · Answer 12

Jack Corbett (2/17/2009)

The only problem with this is that now the query has to scan every row in t_Sale to apply the coalesce.

Dinkum? Sorry about the slang - and no I'm not an aussie - it seems somehow appropriate though and will check it out. Every row, not only those being returned in the join? Do you know off-hand what the cost comparison would be vs. applying the same formatting to the WHERE clause, presuming, of course, that one is not diametrically opposed to checking for NULLS and therefore making the point moot?

Max

bvaljalo-1000038 Right there with Babe Points: 760 More actions · Answer 13

Max, I don't think it's possible for the query engine to 'know' what 'only the rows returned by the join' are ... considering that the comparison in question is a part OF the join.

This being said, I also don't think it'd make any difference if you moved the comparison to the WHERE clause either. I should think the execution plan would remain the same, IOW.

I think the source of the issue Jack has mentioned is actually due to the use of a Function in one side of the comparison, which would cause indexes to not be used ...

Just a guess on my part, though 😉