February 3, 2009 at 1:30 pm
This is probably basic stuff, but table1 (~50 000 rows) has an integer field (SomeIntVal) that references the identity field in table2 (~50 000 rows). Integrity is not enforced so values in table1.SomeIntVal may be NULL or non existing in table2.id.
I want to query all the "dirty" rows in table1:
SELECT count(*)
FROM table1
WHERE SomeIntVal IS NULL OR SomeIntVal NOT IN (SELECT id FROM table2)
Why does this query take forever to execute?
If I query either of the conditions alone it executes in less than a second.
SELECT count(*)
FROM table1
WHERE SomeIntVal IS NULL
SELECT count(*)
FROM table1
WHERE SomeIntVal NOT IN (SELECT id FROM table2)
A full example below, this one took about 7 seconds for the last query
--Create a table
CREATE TABLE #table1 (ID INT IDENTITY,sparsekey INT)
--Create a simple lookup table
CREATE TABLE #table2 (ID INT IDENTITY,someval INT)
GO
--Populate table2
INSERT INTO #table2
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
GO
INSERT INTO #table2
SELECT a.someval FROM #table2 a, #table2 b, #table2 c
ORDER BY c.someval, b.someval, a.someval
GO
INSERT INTO #table2
SELECT a.someval FROM #table2 a, #table2 b, #table2 c
ORDER BY c.someval, b.someval, a.someval
GO
--Populate the field sparsekey in table1 with sparse values that sometimes references #table2
INSERT INTO #table1
SELECT TOP 2000 CASE someval
WHEN 1 THEN NULL
WHEN 2 THEN 0
ELSE ID
END
FROM #table2
GO
DECLARE @StartTime datetime
--Get the count of sparsekey values that are NULL
SET @StartTime = GETDATE()
DECLARE @RetVal int
SELECT @RetVal = count(*) FROM #table1
WHERE sparsekey IS NULL
SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q1
--Get the count of sparsekey values that are NOT IN #table2
SET @StartTime = GETDATE()
SELECT @RetVal = count(*) FROM #table1
WHERE sparsekey NOT IN (SELECT ID FROM #table2)
SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q2
--Get the count of sparsekey values that are NULL OR NOT IN #table2
SET @StartTime = GETDATE()
SELECT @RetVal = count(*) FROM #table1
WHERE sparsekey IS NULL
OR sparsekey NOT IN (SELECT ID FROM #table2)
SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3
GO
DROP TABLE #table1
DROP TABLE #table2
GO
February 3, 2009 at 1:35 pm
Try changing your where clause to use an exists
WHERE SomeIntVal IS NULL OR SomeIntVal NOT EXISTS(SELECT 1 FROM FROM table2 WHERE id = SomeIntVal)
hope you have SomeIntVal indexed.
February 3, 2009 at 1:43 pm
There you go. I knew there was a simple solution to this.
I guess I could also drop the first condition here as it will be redundant if id in table2 is NOT NULL
WHERE SomeIntVal IS NULL OR SomeIntVal NOT EXISTS(SELECT 1 FROM FROM table2 WHERE id = SomeIntVal)
Thanx.
February 3, 2009 at 2:22 pm
Using a left outer join will improve your performance. I added the NOT EXISTS and OUTER JOIN methods to your code.
--Get the count of sparsekey values that are NULL OR NOT IN #table2 using IN
SET @StartTime = GETDATE()
SELECT @RetVal = count(*) FROM #table1
WHERE sparsekey IS NULL
OR sparsekey NOT IN (SELECT ID FROM #table2)
SELECT 'Using IN ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3
--Get the count of sparsekey values that are NULL OR NOT IN #table2 using NOT EXISTS
SET @StartTime = GETDATE()
SELECT @RetVal = count(*) FROM #table1
WHERE sparsekey IS NULL
OR NOT EXISTS(SELECT 1 FROM #table2 WHERE id = sparsekey)
SELECT 'Using NOT EXISTS ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3
--Get the count of sparsekey values that are NULL OR NOT IN #table2 using left outer join
SET @StartTime = GETDATE()
SELECT @RetVal = count(*)
FROM #table1 as a
LEFT OUTER JOIN #table2 as b
ON sparsekey = b.ID
WHERE sparsekey IS NULL OR b.ID IS NULL
SELECT 'Using LEFT OUTER JOIN',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3
results here are as follows:
method nrows ExecutionTime_Q3
--------------------- ----------- ----------------
Using IN 1990 8076
method nrows ExecutionTime_Q3
--------------------- ----------- ----------------
Using NOT EXISTS 1990 126
method nrows ExecutionTime_Q3
--------------------- ----------- ----------------
Using LEFT OUTER JOIN 1990 13
February 3, 2009 at 5:48 pm
There's no question that using IN is the wrong option.
However, the choice between left outer join and the not exists clause appears a bit less 'cut and dry', IMHO. I suspect one method will be favorable under some conditions, and the other better in other conditions.
When I run the below (slightly modified version of your) script, using real tables with the ID's set to be primary keys (and thus indexed), and clearing the caches (like I was always told one should do) between running 'benchmarks' of this nature, I found the execution time of Using IN to run about 4600ms, whilst both of the other two options ran instantaneously (i.e. either 0 or 16ms).
Something interesting about this whole thing is ... If you turn on Display Actual Execution Plan, you'll discover that the Not In and Not Exists methods both end up using the exact same execution plan, yet our tests show that the former is orders of magnitude slower.
Also, strangely, if you add up the percentages on Cost %'s on both of these two methods, they add up to like 200% ... in fact, just the clustered index seek ALONE shows as cost: 100%. Whereas, in the AEP of the LOJ method, the percentages add to 100%, like they normally do.
Another strange thing is that in terms of query cost relative to the batch, the Not In and Not Exist method show up as 14% each, and the LOJ method relative cost is 72%...
CREATE TABLE table1 (ID INT IDENTITY,sparsekey INT, primary key(id))
--Create a simple lookup table
CREATE TABLE table2 (ID INT IDENTITY,someval INT, primary key(id))
--Populate table2
INSERT INTO table2
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
INSERT INTO table2
SELECT a.someval FROM table2 a, table2 b, table2 c
ORDER BY c.someval, b.someval, a.someval
INSERT INTO table2
SELECT a.someval FROM table2 a, table2 b, table2 c
ORDER BY c.someval, b.someval, a.someval
--Populate the field sparsekey in table1 with sparse values that sometimes references table2
INSERT INTO table1
SELECT TOP 2000 CASE someval
WHEN 1 THEN NULL
WHEN 2 THEN 0
ELSE ID
END
FROM table2
GO
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
--Get the count of sparsekey values that are NULL OR NOT IN table2 using IN
DECLARE @StartTime datetime, @retval int
SET @StartTime = GETDATE()
SELECT @RetVal = count(*) FROM table1
WHERE sparsekey IS NULL
OR sparsekey NOT IN (SELECT ID FROM table2)
SELECT 'Using IN ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3
GO
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
--Get the count of sparsekey values that are NULL OR NOT IN table2 using left outer join
DECLARE @StartTime datetime, @retval int
SET @StartTime = GETDATE()
SELECT @RetVal = count(*)
FROM table1 as a
LEFT OUTER JOIN table2 as b
ON sparsekey = b.ID
WHERE sparsekey IS NULL OR b.ID IS NULL
SELECT 'Using LEFT OUTER JOIN',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3
GO
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
--Get the count of sparsekey values that are NULL OR NOT IN table2 using NOT EXISTS
DECLARE @StartTime datetime, @retval int
SET @StartTime = GETDATE()
SELECT @RetVal = count(*) FROM table1
WHERE sparsekey IS NULL
OR NOT EXISTS(SELECT 1 FROM table2 WHERE id = sparsekey)
SELECT 'Using NOT EXISTS ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3
GO
DROP TABLE table1
DROP TABLE table2
GO
February 3, 2009 at 5:54 pm
[font="Verdana"]Welcome to the joys of execution plans. :w00t:
From experience, the left outer join always tend to perform at least as well as the not exists. So I always tend to write that statement now as a left outer join.
I also know that the SQL Server optimiser does some smart things with optimising common forms of "not in" and "not exists", and can potentially generate much the same query plan for these. However, I read up on this a couple of years ago (with the introduction of SQL Server 2005) and I'm not exactly sure what happens there any more.
My recommendation: stick with the left outer join. It's hard to beat.
[/font]
February 3, 2009 at 6:34 pm
Yip, second (or third) the use of joins. The WHERE clause can also be omitted by placing the condition in the join too. May I also suggest catering for presumed NULLS:
SELECT e.EmpID, s.nvcProdName
FROM t_Employee e
LEFT OUTER JOIN t_Sale s
ON e.EmpID = s.EmpID
AND COALESCE( s.nvcProdName, '' ) <> ''
Max
February 3, 2009 at 6:46 pm
[font="Verdana"]Does that actually work?
I know in Oracle, if you tried to put the filter criteria for the right hand side being empty into the ON clause, it didn't work as expected.
I might be able to eliminate a few where clauses!
[/font]
February 3, 2009 at 6:55 pm
Bruce W Cassidy (2/3/2009)
[font="Verdana"]Does that actually work?I know in Oracle, if you tried to put the filter criteria for the right hand side being empty into the ON clause, it didn't work as expected.
I might be able to eliminate a few where clauses!
[/font]
I hope so, frowned upon by some WHERE clause purists though >:(
Please let me know more about the Oracle rhs problems, otherwise I may have to eliminate a few join conditions. :w00t: :hehe:
Max
February 3, 2009 at 7:01 pm
[font="Verdana"]Oh, who knows. I haven't played with Oracle for a couple of years now, so maybe the best thing is to do the old "suck it and see" approach. I'll try putting the filter criteria into my join conditions on SQL Server. :cool:[/font]
February 17, 2009 at 3:54 am
Hi Bruce,
Gints wrote some informative tests on join criteria: http://www.gplivna.eu/papers/sql_join_types.htm#p6.4, thanks Gints, this does help clarify a lot.
Max
February 17, 2009 at 7:12 am
Max (2/3/2009)
Yip, second (or third) the use of joins. The WHERE clause can also be omitted by placing the condition in the join too. May I also suggest catering for presumed NULLS:
SELECT e.EmpID, s.nvcProdName
FROM t_Employee e
LEFT OUTER JOIN t_Sale s
ON e.EmpID = s.EmpID
AND COALESCE( s.nvcProdName, '' ) <> ''
The only problem with this is that now the query has to scan every row in t_Sale to apply the coalesce.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
February 17, 2009 at 1:09 pm
Jack Corbett (2/17/2009)
The only problem with this is that now the query has to scan every row in t_Sale to apply the coalesce.
Dinkum? Sorry about the slang - and no I'm not an aussie - it seems somehow appropriate though and will check it out. Every row, not only those being returned in the join? Do you know off-hand what the cost comparison would be vs. applying the same formatting to the WHERE clause, presuming, of course, that one is not diametrically opposed to checking for NULLS and therefore making the point moot?
Max
February 17, 2009 at 2:15 pm
Max, I don't think it's possible for the query engine to 'know' what 'only the rows returned by the join' are ... considering that the comparison in question is a part OF the join.
This being said, I also don't think it'd make any difference if you moved the comparison to the WHERE clause either. I should think the execution plan would remain the same, IOW.
I think the source of the issue Jack has mentioned is actually due to the use of a Function in one side of the comparison, which would cause indexes to not be used ...
Just a guess on my part, though 😉
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply