December 16, 2022 at 3:38 pm
I thought I understood the "NOT IN" logic very well until I run into a case where the presence of NULL values affects the result in a way I was not expecting. The setup is super simple:
CREATE TABLE A1(ID1 INT);
CREATE TABLE A2(ID2 INT);
INSERT INTO A1(ID1) VALUES (1), (2), (NULL);
INSERT INTO A2(ID2) VALUES (1), (3), (NULL);
The surprising (to me) fact is that the following query:
SELECT ID1 FROM A1
WHERE ID1 NOT IN (SELECT ID2 FROM A2);
returns no rows, when thought it should return the row where ID1 = 2. And changing the query to:
SELECT ID1 FROM A1
WHERE ID1 NOT IN (SELECT ID2 FROM A2 WHERE ID2 IS NOT NULL);
returns the (expected) row with ID1 =2.
So it looks like "ID1 NOT IN " is translated as (ID1 <> 1 AND ID1 <> 3 AND ID1 <> NULL), instead of NOT (ID1 = 1 OR ID1 = 3 OR ID1=NULL), is that right?
December 16, 2022 at 5:42 pm
The problem is documented (well, kind of) in the Microsoft Documentation. They "cop" out on explaining exactly what could happen by "only" saying that having a NULL in the object of the IN can produce "unexpected results".
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/in-transact-sql
If you look at the Nested Loops operator in the Actual Execution Plan, the reason becomes crystal clear... if you haven't messed up your settings in SQL Server, a NULL cannot be compared to anything else with a relational operator and so the predicate ends up looking like what's in the WHERE clause of the code below (I included the results from both tables so you can see why... change the * to A2.* to see what's returned for A2)...
SELECT * FROM A1,A2
WHERE [A1].[ID1] IS NULL
OR [A2].[ID2] IS NULL
OR [A1].[ID1]=[A2].[ID2]
That's also why your "NOT NULL" addition worked.
Basically, a NULL in the object of the IN turns it into a FULL OUTER JOIN and something will ALWAYS be returned from A2, unless otherwise excluded like you did with your NOT NULL addition.
Having a NULL in the outer table (A1 in this case), won't cause the same problem. Any NULL in the object of the IN comparison operation will.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2022 at 6:41 pm
Yes, the query plan makes it clear, so I have to accept it, but the result still feels "wrong" to me because 2 is not in the result set (1,3,NULL). Anyway, it is what it is. Thank you.
December 16, 2022 at 7:44 pm
I agree... since NULL is not "nothing" nor even equal to another NULL, I don't know why they made it evaluate for NULL behind the scenes at all or why they didn't make it where IS NOT NULL AND...
--Jeff Moden
Change is inevitable... Change for the better is not.
December 18, 2022 at 1:54 pm
Just a quick thought, if one uses ISNULL one can lessen the internal cardinality from the NOT IN subquery that are pushed into the nested loop.
😎
Here is an example:
USE TEEST;
GO
SET NOCOUNT ON;
GO
DECLARE @A1 TABLE (ID1 INT);
DECLARE @A2 TABLE (ID2 INT);
INSERT INTO @A1(ID1) VALUES (1), (2), (NULL);
INSERT INTO @A2(ID2) VALUES (1), (3), (NULL);
SELECT A1.ID1 FROM @A1 A1
WHERE ISNULL(A1.ID1,-1) NOT IN
(
SELECT ISNULL(A2.ID2,-2) FROM @A2 A2
)
AND A1.ID1 IS NOT NULL;
SELECT A1.ID1 FROM @A1 A1
WHERE A1.ID1 NOT IN (SELECT A2.ID2 FROM @A2 A2 WHERE A2.ID2 IS NOT NULL);
I must say that I find it unfortunate that SQL Server handles NULL as UNKNOWN of the lowest possible numerical value, how can an UNKNOWN value be handled as a value when it is UNKNOWN?
December 18, 2022 at 4:37 pm
Maybe I am off-base here, but this really isn't about how Microsoft handles NULLs, rather it is more simply a logic problem.
When using IN - match on ANY item in the list which translates to OR condition.
When using NOT IN - does not match ALL items which translates to AND condition.
For an OR condition, the NULL doesn't matter and we get a result if the value compared is in the list (unless the value is NULL since NULL <> NULL). For the AND condition - nothing will ever match a NULL so it fails for any value being compared to the list.
Another method of writing the NOT IN:
SELECT A1.ID1 FROM @A1 A1
WHERE A1.ID1 <> ALL (SELECT A2.ID2 FROM @A2 A2);
I may be wrong - but for me this isn't an implementation issue or something that MS got 'wrong'.
Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”
― Charles R. Swindoll
How to post questions to get better answers faster
Managing Transaction Logs
December 18, 2022 at 9:28 pm
For me, it's very much something that MS got wrong. They're even trying to make it so that the NULL settings are not optional and Front-Enders are ready to march on Redmond because of it.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 18, 2022 at 10:09 pm
@JeffreyWilliams Yes, the query behaves as if one had written the AND condition you mention (see the last part of the original post, where I made the same observation), but "NOT IN" could also be evaluated by doing the "IN" condition first and then negating the result, and the question is then why isn't that the preferred way? notice that per Jeff Modem's answer, the actual query used by SQL Server is very different.
December 18, 2022 at 10:13 pm
@Eirikur That works, of course, but SQL Server won't be able to use any index for that query.
December 19, 2022 at 7:25 am
You can always use EXCEPT
SELECT ID1 FROM A1
EXCEPT
SELECT ID2 FROM A2
December 19, 2022 at 2:06 pm
@Eirikur That works, of course, but SQL Server won't be able to use any index for that query.
Using indices is not necessarily the same as having better performance, this is a fine example of that exception.
😎
I'll rig up a test harness for demonstration and post it here 😉
December 19, 2022 at 4:33 pm
Sam-263310 wrote:@Eirikur That works, of course, but SQL Server won't be able to use any index for that query.
Using indices is not necessarily the same as having better performance, this is a fine example of that exception. 😎 I'll rig up a test harness for demonstration and post it here 😉
Ah.. while I totally understand and appreciate what you're saying there, be careful. Just like Knuth's parable about "pre-optimization", a lot of people will take something like that and run in the totally wrong direction with it. I'm very much looking forward to you demo.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 19, 2022 at 4:41 pm
Just putting the money where the mouth is 😉
😎
Test result sets:
-----------------------------------------------------------------------------------------------------------------
With non-clustered indices on both columns
-----------------------------------------------------------------------------------------------------------------
T_TXT CPU IDLE IO DURATION SAMPLE_SIZE
-------------------- ---------------------- ---------------------- ---------------------- ----------- -----------
ISNULL 62500 437500 0 78117 500000
EXCEPT 62500 687500 0 93644 500000
NOT IN 125000 1125000 0 140633 500000
DRY RUN 281250 2468750 0 343712 500000
CREATE INDICES 750000 2968750 31250 470331 500000
CREATE TEST SET 687500 5812500 0 830595 500000
-----------------------------------------------------------------------------------------------------------------
ISNULL 125000 1125000 0 141133 1000000
EXCEPT 125000 1125000 0 187602 1000000
NOT IN 281250 2218750 0 297118 1000000
DRY RUN 562500 4437500 0 641074 1000000
CREATE INDICES 1500000 5250000 0 876663 1000000
CREATE TEST SET 1437500 11562500 0 1657997 1000000
-----------------------------------------------------------------------------------------------------------------
ISNULL 562500 5187500 0 718777 5000000
EXCEPT 812500 6187500 0 907878 5000000
NOT IN 1437500 10562500 0 1516742 5000000
CREATE INDICES 7843750 10843750 62500 2384447 5000000
DRY RUN 3312500 20156250 31250 3038696 5000000
CREATE TEST SET 7343750 59468750 187500 8578661 5000000
-----------------------------------------------------------------------------------------------------------------
ISNULL 1468750 10781250 0 1579692 10000000
EXCEPT 1718750 13031250 0 1861461 10000000
NOT IN 2750000 20750000 0 3033735 10000000
CREATE INDICES 15843750 18125000 281250 4361081 10000000
DRY RUN 6593750 40437500 218750 6041686 10000000
CREATE TEST SET 14625000 118687500 187500 17113108 10000000
-----------------------------------------------------------------------------------------------------------------
Without non-clustered indices on both columns
-----------------------------------------------------------------------------------------------------------------
T_TXT CPU IDLE IO DURATION SAMPLE_SIZE
-------------------- ---------------------- ---------------------- ---------------------- ----------- -----------
ISNULL 31250 31250 0 15723 50000
EXCEPT 31250 218750 0 31422 50000
CREATE TEST SET 62500 687500 0 93818 50000
NOT IN 187500 1562500 0 234265 50000
DRY RUN 218750 2281250 0 312655 50000
-----------------------------------------------------------------------------------------------------------------
ISNULL 31250 218750 0 31449 100000
EXCEPT 31250 468750 0 62694 100000
CREATE TEST SET 187500 1062500 0 172295 100000
NOT IN 625000 4375000 0 642562 100000
DRY RUN 718750 5531250 0 799844 100000
-----------------------------------------------------------------------------------------------------------------
ISNULL 93750 656250 0 93739 500000
EXCEPT 93750 656250 0 109749 500000
CREATE TEST SET 750000 6000000 0 862358 500000
NOT IN 10218750 76000000 31250 11033902 500000
DRY RUN 10812500 80156250 31250 11676433 500000
-----------------------------------------------------------------------------------------------------------------
ISNULL 156250 1343750 0 171763 1000000
EXCEPT 218750 1281250 0 203106 1000000
CREATE TEST SET 1406250 11843750 0 1704699 1000000
NOT IN 38343750 286093750 62500 41519872 1000000
DRY RUN 39468750 289968750 62500 42194149 1000000
-----------------------------------------------------------------------------------------------------------------
ISNULL 812500 6406250 31250 923459 5000000
EXCEPT 1218750 9031250 0 1314141 5000000
CREATE TEST SET 7343750 58593750 62500 8431584 5000000
NOT IN 596500000 4462312500 437500 647572582 5000000
DRY RUN 599437500 4476343750 468750 649767308 5000000
-----------------------------------------------------------------------------------------------------------------
ISNULL 2375000 17625000 0 2563753 10000000
EXCEPT 2437500 18531250 31250 2690276 10000000
CREATE TEST SET 17750000 139031250 218750 20103493 10000000
NOT IN 1215656250 9075781250 812500 1317398739 10000000
DRY RUN 1220781250 9103218750 1000000 1321598135 10000000
-----------------------------------------------------------------------------------------------------------------
And the test harness:
USE TEEST;
GO
SET NOCOUNT ON;
GO
DECLARE @SAMPLE_SIZE INT = 10000000;
DECLARE @NULL_SEED_1 INT = 5;
DECLARE @NULL_SEED_2 INT = 7;
DECLARE @INT_BUCKET1 INT = 0;
DECLARE @INT_BUCKET2 INT = 0;
DECLARE @TIMER TABLE (
T_TXT VARCHAR(20) NOT NULL
,T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME())
,T_CPU FLOAT NOT NULL DEFAULT (@@CPU_BUSY * CONVERT(FLOAT,@@TIMETICKS,0))
,T_IDLE FLOAT NOT NULL DEFAULT (@@IDLE * CONVERT(FLOAT,@@TIMETICKS,0))
,T_IO FLOAT NOT NULL DEFAULT (@@IO_BUSY * CONVERT(FLOAT,@@TIMETICKS,0))
);
--/*
IF OBJECT_ID('dbo.TBL_TEST_NOTIN_1') IS NOT NULL DROP TABLE dbo.TBL_TEST_NOTIN_1;
CREATE TABLE dbo.TBL_TEST_NOTIN_1
(
T_NUM INT NULL
)
;
IF OBJECT_ID('dbo.TBL_TEST_NOTIN_2') IS NOT NULL DROP TABLE dbo.TBL_TEST_NOTIN_2;
CREATE TABLE dbo.TBL_TEST_NOTIN_2
(
T_NUM INT NULL
)
;
INSERT INTO @TIMER(T_TXT) VALUES ('CREATE TEST SET');
;WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE)
ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)
INSERT INTO dbo.TBL_TEST_NOTIN_1 WITH (TABLOCKX) (T_NUM)
SELECT
CASE
WHEN NM.N % @NULL_SEED_1 > 0 THEN NM.N
ELSE NULL
END
FROM NUMS NM
;
;WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE)
ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)
INSERT INTO dbo.TBL_TEST_NOTIN_2 WITH (TABLOCKX) (T_NUM)
SELECT
CASE
WHEN NM.N % @NULL_SEED_2 > 0 THEN NM.N
ELSE NULL
END
FROM NUMS NM
;
INSERT INTO @TIMER(T_TXT) VALUES ('CREATE TEST SET');
INSERT INTO @TIMER(T_TXT) VALUES ('CREATE INDICES');
CREATE NONCLUSTERED INDEX NCLIDX_DBO_TBL_TEST_NOTIN_1_T_NUM ON dbo.TBL_TEST_NOTIN_1 (T_NUM)
CREATE NONCLUSTERED INDEX NCLIDX_DBO_TBL_TEST_NOTIN_2_T_NUM ON dbo.TBL_TEST_NOTIN_2 (T_NUM)
INSERT INTO @TIMER(T_TXT) VALUES ('CREATE INDICES');
-- */
INSERT INTO @TIMER(T_TXT) VALUES ('DRY RUN');
SELECT
@INT_BUCKET1 = A1.T_NUM
,@INT_BUCKET2 = A2.T_NUM
FROM dbo.TBL_TEST_NOTIN_1 A1
LEFT OUTER JOIN dbo.TBL_TEST_NOTIN_2 A2
ON A1.T_NUM = A2.T_NUM;
INSERT INTO @TIMER(T_TXT) VALUES ('DRY RUN');
INSERT INTO @TIMER(T_TXT) VALUES ('NOT IN');
SELECT
@INT_BUCKET1 = A1.T_NUM
FROM dbo.TBL_TEST_NOTIN_1 A1
WHERE A1.T_NUM NOT IN (SELECT A2.T_NUM FROM dbo.TBL_TEST_NOTIN_2 A2 WHERE A2.T_NUM IS NOT NULL)
OPTION (MAXDOP 1);
INSERT INTO @TIMER(T_TXT) VALUES ('NOT IN');
INSERT INTO @TIMER(T_TXT) VALUES ('EXCEPT');
;WITH EXCEPT_SET(T_NUM) AS
(
SELECT
A1.T_NUM
FROM dbo.TBL_TEST_NOTIN_1 A1
EXCEPT
SELECT A2.T_NUM FROM dbo.TBL_TEST_NOTIN_2 A2
)
SELECT
@INT_BUCKET1 = ES.T_NUM
FROM EXCEPT_SET ES
OPTION (MAXDOP 1);
INSERT INTO @TIMER(T_TXT) VALUES ('EXCEPT');
INSERT INTO @TIMER(T_TXT) VALUES ('ISNULL');
SELECT
@INT_BUCKET1 = A1.T_NUM
FROM dbo.TBL_TEST_NOTIN_1 A1
WHERE ISNULL(A1.T_NUM,-1) NOT IN
(
SELECT ISNULL(A2.T_NUM,-2) FROM dbo.TBL_TEST_NOTIN_2 A2
)
AND A1.T_NUM IS NOT NULL
OPTION (MAXDOP 1);
INSERT INTO @TIMER(T_TXT) VALUES ('ISNULL');
SELECT
T.T_TXT
,MAX(T.T_CPU) - MIN(T.T_CPU) AS CPU
,MAX(T.T_IDLE) - MIN(T.T_IDLE) AS IDLE
,MAX(T.T_IO) - MIN(T.T_IO) AS IO
,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION
,@SAMPLE_SIZE AS SAMPLE_SIZE
FROM @TIMER T
GROUP BY T.T_TXT
ORDER BY DURATION ASC;
Please play around with the tests, add your tweaks and have fun 😉
Edit: Had to correct code pasting errors.
December 19, 2022 at 4:58 pm
Eirikur Eiriksson wrote:Sam-263310 wrote:@Eirikur That works, of course, but SQL Server won't be able to use any index for that query.
Using indices is not necessarily the same as having better performance, this is a fine example of that exception. 😎 I'll rig up a test harness for demonstration and post it here 😉
Ah.. while I totally understand and appreciate what you're saying there, be careful. Just like Knuth's parable about "pre-optimization", a lot of people will take something like that and run in the totally wrong direction with it. I'm very much looking forward to you demo.
Now when are we going to meet up so you can buy me a burger and I can show you the demo 😉
😎
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply