Four Rules for NULLs

  • Mike C (3/28/2008)


    I'd agree that this is probably a common usage, and as long as you're comfortable with using NULL to represent multiple states that might be adequate. The problems come when you want to run a data quality report: for instance, you need to know how many employer id's weren't entered in the system because the employer id was not known at the time? How do you separate that from the "unemployed" entries using the same NULL to represent two statuses? And if you decide to expand the scope of your data collection to include the reason for unemployment - "unemployed: quit", "unemployed: fired", "unemployed: laid off", etc. All of these states would be stored using NULL as well? Seems like there should be a better way to represent these types of attributes without overloading NULL.

    I guess that would be stemming from a misuse of NULL or misunderstanding thereof, which usually comes from trying to derive or assign meaning to NULL. Meaning - unknown really does mean "unknown", not "unknown, but we think it means X". Any projection, assumption, derivation, or logical conclusion as to the reason why will land you in hot water, IMO.

    As to your example quoted: I fully understand what you're talking about, but how would tracking WHY employerID happens to be NULL change the fact that employerID should in fact in all of those cases be NULL, unless you have the luxury to assign a value to be the "invalid employer" field. Surely you're not recommending using employerID to track 2 different meanings (who the employer is, or WHY there is no employer)? That's a rather large violation of 3NF in its own right.

    In my mind - IF there is sufficient reason to need to know why that field was NULL, then it needs to be tracked appropriately, in its own column (reasons why EmployerID might be null).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • The "Never Use NULL's" rule only holds true in OLTP. BI and Reporting almost always need to have some way of distinguishing: unknown, not applicable, empty, and zero.

    Take Date of Birth. If I want to count the number of customers under the age of 25, how do I do that? What do I do with customer records that have no value. Do I stick some arbitrary value into DOB field that I can use intelligently to distinguish it from other records?

    What about medical applications? You absolutely need to distinguish between unknown, not applicable, empty, and zero. A subject may not have taken the test or the value may not be available. You certainly do not want to use some arbitrary numerical value because that means something. Also, if you make the column string and use some value such as "na" or "unk" or other, your ability to use query tools (other than hand coded standard reports) is shot.

    Bill

  • Thanks for a very informative and thought-provoking article. Shows me I still have lots to learn about SQL.

    I have a point of unclarity or inaccuracy to pound on so that I can be sure I understood how NULL is supposed to work. Please don't take it as criticism but as investigation into the facts. The article says,

    NOTE: It has been widely misstated (even in SQL Server Books Online) that "the result of a comparison is false if one of the operands is NULL." In fact, the ANSI-92 SQL Standard states that if one of the operands is NULL, the result of the comparison is "UNKNOWN".

    ...

    NULLs are not equal to other NULLs.

    ...

    Now we see that the result of the comparison is neither True nor False. It is the third value in our three-valued logic, Unknown. Try this statement to demonstrate that NULL is not equal to other NULLs:

    ...

    Rule #2: In ANSI SQL, NULL is not equal to anything, even other NULLs! Comparisons with NULL always result in UNKNOWN.

    There seems to me to be a contradiction here. Yes, it's getting down to semantic quibbles, but as we see with NULL, semantic quibbles can be very significant!

    You point out misstatements in e.g. SQL Server Books Online that say that the result of a comparison to NULL is false. Yet several times in the article, you say that NULLs "are not equal" to anything, even other NULLs. To my understanding, this statement is inconsistent with the statement that comparing a NULL with something does not yield FALSE. In other words,

    NULLs are not equal to anything

    seems equivalent to

    (NULL not-equals something) yields TRUE

    when in fact, we know that

    (NULL not-equals something) yields Unknown

    Rather than saying that

    NULLs "are not equal" to anything,

    wouldn't it be more accurate to say that

    it is Unknown whether NULL is equal to anything

    ? or

    It is not TRUE that NULL is equal to anything

    ?

    Unfortunately, if you say

    It is not true that NULL is equal to anything

    ... you then have the problem that "not true" is normally understood to mean "false", but in 3-valued logic it could mean "false or unknown"...? But in the latter case, wouldn't you say

    It is Unknown whether it is true that NULL is equal to anything

    or in natural English you could try

    NULLs are not necessarily equal to anything

    ... but that may be even less intuitive!

    Aaaaaaaargghhhhh... :crazy:

    I think this point of confusion is due to two-valued logic in English ...

    Thanks for lending your time and expertise.

  • Matt Miller (3/28/2008)


    Mike C (3/28/2008)


    I'd agree that this is probably a common usage, and as long as you're comfortable with using NULL to represent multiple states that might be adequate. The problems come when you want to run a data quality report: for instance, you need to know how many employer id's weren't entered in the system because the employer id was not known at the time? How do you separate that from the "unemployed" entries using the same NULL to represent two statuses? And if you decide to expand the scope of your data collection to include the reason for unemployment - "unemployed: quit", "unemployed: fired", "unemployed: laid off", etc. All of these states would be stored using NULL as well? Seems like there should be a better way to represent these types of attributes without overloading NULL.

    I guess that would be stemming from a misuse of NULL or misunderstanding thereof, which usually comes from trying to derive or assign meaning to NULL. Meaning - unkwown really does mean "unknown", not "unknown, but we think it means X". Any projection, assumption, derivation, or logical conclusion as to the reason why will land you in hot water, IMO.

    Exactly.

    As to your example quoted: I fully understand what you're talking about, but how would tracking WHY employerID happens to be NULL change the fact that employerID should in fact in all of those cases be NULL, unless you have the luxury to assign a value to be the "invalid employer" field. Surely you're not recommending using employerID to track 2 different meanings (who the employer is, or WHY there is no employer)? That's a rather large violation of 3NF in its own right.

    And that is exactly what the poster is proposing.

    In my mind - IF there is sufficient reason to need to know why that field was NULL, then it needs to be tracked appropriately, in its own column (reasons why EmployerID might be null).

    Exactly.

  • Lars - the issue you seem to be hammering on stems from treating NULL as if it were a value. NULL is technically not a value, so any comparison of a value to a non-value returns an unknown value. NULL is a "state" not a value.

    It's kind of like comparing apples and philosophy. There's no valid basis for comparison (I know we could probably come up with one, but work with me here), so any operation designed to check on value equality or inequality should fail altogether before the comparison even happens (therefore not a TRUE or FALSE), since there's no context in which to initiate, let alone perform, the comparison.

    As a result - NULL is neither equal nor not equal to anything, including itself.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Lars Huttar (3/28/2008)


    Thanks for a very informative and thought-provoking article. Shows me I still have lots to learn about SQL.

    I have a point of unclarity or inaccuracy to pound on so that I can be sure I understood how NULL is supposed to work. Please don't take it as criticism but as investigation into the facts. The article says,

    NOTE: It has been widely misstated (even in SQL Server Books Online) that "the result of a comparison is false if one of the operands is NULL." In fact, the ANSI-92 SQL Standard states that if one of the operands is NULL, the result of the comparison is "UNKNOWN".

    ...

    NULLs are not equal to other NULLs.

    ...

    Now we see that the result of the comparison is neither True nor False. It is the third value in our three-valued logic, Unknown. Try this statement to demonstrate that NULL is not equal to other NULLs:

    ...

    Rule #2: In ANSI SQL, NULL is not equal to anything, even other NULLs! Comparisons with NULL always result in UNKNOWN.

    There seems to me to be a contradiction here. Yes, it's getting down to semantic quibbles, but as we see with NULL, semantic quibbles can be very significant!

    You point out misstatements in e.g. SQL Server Books Online that say that the result of a comparison to NULL is false. Yet several times in the article, you say that NULLs "are not equal" to anything, even other NULLs. To my understanding, this statement is inconsistent with the statement that comparing a NULL with something does not yield FALSE. In other words,

    NULLs are not equal to anything

    seems equivalent to

    (NULL not-equals something) yields TRUE

    when in fact, we know that

    (NULL not-equals something) yields Unknown

    The assumption that any non-True result is False, and vice versa, works for two-valued Boolean logic but doesn't apply to SQL 3VL. The predicate (NULL = 3) is not True, but that does not automatically make it False in 3VL.

  • Matt Miller (3/28/2008)


    Lars - the issue you seem to be hammering on stems from treating NULL as if it were a value. NULL is technically not a value, so any comparison of a value to a non-value returns an unknown value. NULL is a "state" not a value.

    It's kind of like comparing apples and philosophy. There's no valid basis for comparison (I know we could probably come up with one, but work with me here), so any operation designed to check on value equality or inequality should fail altogether before the comparison even happens (therefore not a TRUE or FALSE), since there's no context in which to initiate, let alone perform, the comparison.

    Understood and agreed.

    As a result - NULL is neither equal nor not equal to anything, including itself.

    Ah - I like your last statement (let's call it S3) better than "NULL is not equal to anything" (let's call it S1)... I think. At least it avoids the pitfall of seeming to say that "NULL = something" should yield FALSE.

    On the other hand, taking 3-valued logic to its logical conclusion, S3 just postpones the inevitable one step. If we represent S3 as

    NOT(NULL = anything) AND NOT(NULL <> anything)

    then S3 yields Unknown rather than TRUE.

    It seems that if we take 3-valued logic seriously, we need to say

    it is unknown whether (NULL = anything) AND it is unknown whether (NULL <> anything)

    Does SQL have an IS UNKNOWN syntax, corresponding to IS NULL? (I'm going to run a check on what happens when I say IF ((NULL = NULL) IS NULL) THEN ...)

    Again, I think the problem is not in our understanding of the concepts, but in our attempt to encode them into English. Nevertheless expressing them incorrectly, or at least imprecisely, in English leads to others misunderstanding the concepts and producing incorrect designs and/or implementations.

  • Would I be willing to concede that the statement "NULL is not equal to any other value, or other NULLs" is incorrect? Absolutely if the opposite can be demonstrated: i.e., it can be shown that NULL is equal to at least one other value, or another NULL.

  • Lars Huttar (3/28/2008)


    On the other hand, taking 3-valued logic to its logical conclusion, S3 just postpones the inevitable one step. If we represent S3 as

    NOT(NULL = anything) AND NOT(NULL <> anything)

    then S3 yields Unknown rather than TRUE.

    It seems that if we take 3-valued logic seriously, we need to say

    it is unknown whether (NULL = anything) AND it is unknown whether (NULL <> anything)

    S3 would only be valid if this representation were ever valid. The better way to express it might be to state that equality is a relation that cannot be applied to NULLs. It can neither be equal or not equal, because those ultimately can't be expressed meaningfully, let alone evaluated. Doesn't matter how many layers get introduced - it's unknown. Which is why the "checking" syntax is expressed as "IS NULL" or "IS NOT NULL"; it's not the mathematical equality concept at all.

    Quite honestly - I would have thought conceptually/theoretically that 1=NULL should operate a bit like 1/0, meaning "whoa - I don't have a CLUE what THAT is". Of course - this is where theory stops being useful, and unknown just needs to mean unknown.

    Does SQL have an IS UNKNOWN syntax, corresponding to IS NULL? (I'm going to run a check on what happens when I say IF ((NULL = NULL) IS NULL) THEN ...)

    Syntax error from what I can tell... Sorry - boolean "math" doesn't fly in SQL Server, from what I know.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Lars Huttar (3/28/2008)


    Matt Miller (3/28/2008)


    Lars - the issue you seem to be hammering on stems from treating NULL as if it were a value. NULL is technically not a value, so any comparison of a value to a non-value returns an unknown value. NULL is a "state" not a value.

    It's kind of like comparing apples and philosophy. There's no valid basis for comparison (I know we could probably come up with one, but work with me here), so any operation designed to check on value equality or inequality should fail altogether before the comparison even happens (therefore not a TRUE or FALSE), since there's no context in which to initiate, let alone perform, the comparison.

    Understood and agreed.

    As a result - NULL is neither equal nor not equal to anything, including itself.

    Ah - I like your last statement (let's call it S3) better than "NULL is not equal to anything" (let's call it S1)... I think. At least it avoids the pitfall of seeming to say that "NULL = something" should yield FALSE.

    On the other hand, taking 3-valued logic to its logical conclusion, S3 just postpones the inevitable one step. If we represent S3 as

    NOT(NULL = anything) AND NOT(NULL <> anything)

    then S3 yields Unknown rather than TRUE.

    It seems that if we take 3-valued logic seriously, we need to say

    it is unknown whether (NULL = anything) AND it is unknown whether (NULL <> anything)

    Does SQL have an IS UNKNOWN syntax, corresponding to IS NULL? (I'm going to run a check on what happens when I say IF ((NULL = NULL) IS NULL) THEN ...)

    Again, I think the problem is not in our understanding of the concepts, but in our attempt to encode them into English. Nevertheless expressing them incorrectly, or at least imprecisely, in English leads to others misunderstanding the concepts and producing incorrect designs and/or implementations.

    There's no IS UNKNOWN syntax in T-SQL. The logical UNKNOWN result is not the same thing as NULL (although there was some discrepancy over this in the SQL:1999 standard with the Boolean data type where they are treated as logical equivalents, to prevent the introduction of four-valued logic into the system).

  • Mike C (3/28/2008)


    Lars Huttar (3/28/2008)


    In other words,

    NULLs are not equal to anything

    seems equivalent to

    (NULL not-equals something) yields TRUE

    when in fact, we know that

    (NULL not-equals something) yields Unknown

    The assumption that any non-True result is False, and vice versa, works for two-valued Boolean logic but doesn't apply to SQL 3VL. The predicate (NULL = 3) is not True, but that does not automatically make it False in 3VL.

    Your point is understood, but I'm not convinced that it's valid.

    According to the truth tables for 3VL in the article, when p is TRUE, NOT(p) is FALSE. So NOT(TRUE) is FALSE.

    But you say that one cannot assume that "non-True" or "not True" is False.

    Therefore you appear to be drawing a distinction between "NOT(TRUE)" and "not True" (= "non-True").

    I agree that such a distinction (however it may be phrased) is useful and necessary for 3VL, but what I'm saying is that I don't think it's been communicated clearly and accurately in the article. Do you see what I mean?

    "non-True" is a clearer phrasing than "not True", but "non-True" was not used in the article; and neither is unambiguous.

    I agree too that it's hard to find an accurate way to say this that is also intuitive to an English speaker. Maybe the best that can be done is to add a footnote explaining what is meant.

    Even when you say "the predicate (NULL = 3) is not True", others would have grounds to say you're contradicting the rules of ANSI NULL, unless you define what you mean by "is not". (Somebody call Bill Clinton!)

  • Just another point of view. I live in an Oracle, DB2, SQL Server world. Here is a companion article on Oracle's forums which came out about the same time as the aggregate NULL article. I am not sure it says anything new, but it is written with different examples and from the point of view of Oracle DB. A good companion read for this very good topic:

    http://www.oracle.com/technology/oramag/oracle/05-jul/o45sql.html

    Bill

  • Very nice article Michael on a topic where the details really count.

    I had never heard of the COALESCE function before. We use Informix as our primary database and it does not have a COALESCE function. Informix has a function NVL which works the same as ISNULL and COALESCE. It seems strange that IBM would not have provided an implementation for Informix yet. Do you happen to know if I'm missing something?

    --

    Gregg Walker

  • Bill Wimsatt (3/28/2008)


    Just another point of view. I live in an Oracle, DB2, SQL Server world. Here is a companion article on Oracle's forums which came out about the same time as the aggregate NULL article. I am not sure it says anything new, but it is written with different examples and from the point of view of Oracle DB. A good companion read for this very good topic:

    http://www.oracle.com/technology/oramag/oracle/05-jul/o45sql.html

    Bill

    Did Oracle ever fix their empty string --> NULL problem? From what I remember (back in the good old days of 8i) zero-length strings were treated as NULL in some situations on Oracle for some strange reason.

  • Mike C (2/27/2006)


    Both of which, by the way, are non-SARGable in WHERE clauses...

    ??? :blink: Maybe I'm reading this wrong, but WHERE NOT IN most certainly IS sargeble...

    --drop table jbmtest

    --===== Create and populate a 1000 row test table.

    -- Column "RowNum" has a range of 1 to 1000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE dbo.JBMTest

    ADD CONSTRAINT PK_JBMTest_RowNum PRIMARY KEY CLUSTERED (RowNum)

    --===== Now, do a WHERE NOT IN and check the execution plan... INDEX SEEK!!!

    SELECT * FROM jbmtest WHERE RowNum NOT IN (1,2,3)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 91 through 105 (of 152 total)

You must be logged in to reply to this topic. Login to reply