The Logic, Mathematics, and Utility of NULLs

  • mtillman-921105 (8/20/2010)


    jcrawf02 (8/20/2010)


    mtillman-921105 (8/20/2010)


    jcrawf02 (8/20/2010)


    I guess that's a good question, phrased a different way, WHY would that be worthwhile?

    I suppose that I'm answering my own question here but...

    It would be for removing the ambiguity and in some cases, preventing errors.

    Also, consider this: "N/A" would be useful when the information for that column doesn't apply and the front end could even display that value "as is" without code interpreting the answer.

    Ok, can you give a real world example where a user would need to know expressly whether something was unknown, not entered, or not applicable? Not following what problem we solve with this?

    There is an example above where "N/A" would come in handy. What if there is a table about cars and one of the numeric columns is for mile per gallon, but along comes an electric car. A regular old NULL would do, but an "N/A" would show exactly the right information.

    But in that situation, nobody is even going to be looking at that column, so it's irrelevant what we put in there. Can you think of another situation? I can't, which is why I'm asking, not trying to be contentious. I see that there are distinctions between the three notions, I just am trying to figure out when it would become important.

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • jcrawf02 (8/20/2010)


    mtillman-921105 (8/20/2010)


    jcrawf02 (8/20/2010)


    mtillman-921105 (8/20/2010)


    jcrawf02 (8/20/2010)


    I guess that's a good question, phrased a different way, WHY would that be worthwhile?

    I suppose that I'm answering my own question here but...

    It would be for removing the ambiguity and in some cases, preventing errors.

    Also, consider this: "N/A" would be useful when the information for that column doesn't apply and the front end could even display that value "as is" without code interpreting the answer.

    Ok, can you give a real world example where a user would need to know expressly whether something was unknown, not entered, or not applicable? Not following what problem we solve with this?

    There is an example above where "N/A" would come in handy. What if there is a table about cars and one of the numeric columns is for mile per gallon, but along comes an electric car. A regular old NULL would do, but an "N/A" would show exactly the right information.

    But in that situation, nobody is even going to be looking at that column, so it's irrelevant what we put in there. Can you think of another situation? I can't, which is why I'm asking, not trying to be contentious. I see that there are distinctions between the three notions, I just am trying to figure out when it would become important.

    There are some who would want to see an "N/A" for MPG on a form with a list of cars, so I believe that's a good example. There are easily other examples - just think of anytime you've seen "N/A" used.

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • Simple example:

    DISK ROTATIONAL SPEED

    A 15000

    B UNKNOWN

    C N/A

    Disk A is a 15k RPM SAS drive.

    Disk B... well, someone better take a look and see what it is.

    Disk C is an SSD, it doesn't have any rotation, unless it's caught up in a tornado.

    Also, as far as "no-one is going to look at the not applicable column" - absolutely untrue, as old reports from before the "not applicable" or even before the "unknown, cannot be known" or before the "unknown, but find out" or before the "unknown unknown" may look at it no matter what.

  • One last comment, and then I'll leave it alone.

    I think it's over-theorizing, since the user won't care about the rotations on the SSD. Users only look at data in order to make a decision about taking an action. If there's no action to be taken by the data presented, then it's just 'ooh, look at the pretty colors'.

    Therefore, you might as well include how many bunnies can be stored on your disk, or in the car, or how many square feet are in the interior surface area of the gas tank.

    Making three distinctions between data that just plain isn't useful, no matter what you call it, doesn't actually help the user.

    If they want to see the ones where the data hasn't been entered yet, and that's the report they want, they'll ask for that, but I'll bet they'll only want to see records that it APPLIES to, and they'll already have a criteria for how you should find that, which you would just include in the report design, no need to pull everything in the database that has a value of '-'.

    There, I'm done. 😉

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • jcrawf02 (8/20/2010)


    One last comment, and then I'll leave it alone.

    I think it's over-theorizing, since the user won't care about the rotations on the SSD. Users only look at data in order to make a decision about taking an action. If there's no action to be taken by the data presented, then it's just 'ooh, look at the pretty colors'.

    Therefore, you might as well include how many bunnies can be stored on your disk, or in the car, or how many square feet are in the interior surface area of the gas tank.

    Making three distinctions between data that just plain isn't useful, no matter what you call it, doesn't actually help the user.

    If they want to see the ones where the data hasn't been entered yet, and that's the report they want, they'll ask for that, but I'll bet they'll only want to see records that it APPLIES to, and they'll already have a criteria for how you should find that, which you would just include in the report design, no need to pull everything in the database that has a value of '-'.

    There, I'm done. 😉

    Thanks jcrawf02, you have made some valid points there. There are a lot of cases where different NULL types are not going to make any difference. I question whether they would never make a difference, but maybe creating more NULL types just isn't worthwhile. (I'll update the main post to reflect this position when I get more time.)

    Incidentally, I'm trying to approach this as an experiment: I'm not trying to set myself up as the one who has definitive answers and I am leaving my ego out of this. There are lot of people smarter than I am and it doesn't matter in this context if we can gain from everyone's input. Please respectfully argue with any points made here. You don't have to be afraid that your comments aren't welcome if you disagree. :satisfied:

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • Nadrek (8/20/2010)


    Simple example:

    DISK ROTATIONAL SPEED

    A 15000

    B UNKNOWN

    C N/A

    Disk A is a 15k RPM SAS drive.

    Disk B... well, someone better take a look and see what it is.

    Disk C is an SSD, it doesn't have any rotation, unless it's caught up in a tornado.

    But since it doesn't rotate, it's rotational speed is 0 radians/sec or 0 rpm or 0 whatever unit you are using. So the correct value is 0, not N/A.

    Besides, if I'm a user of discs I don't care about the rotational speed, what I care about is latency and transfer rate. So maybe the table is

    DISK TYPICAL LATENCY

    A 0.005

    B UNKNOWN

    C 0.000000017

    for the same 3 discs (actually in the real world I might want to see mean and median latencies for several different access patterns, instead of a single "typical" latency for a single access pattern; and I don't think anyone is yet offering a solid state drive that tightly coupled, but that makes no difference to the point).

    Also, as far as "no-one is going to look at the not applicable column" - absolutely untrue, as old reports from before the "not applicable" or even before the "unknown, cannot be known" or before the "unknown, but find out" or before the "unknown unknown" may look at it no matter what.

    I think probably no-one will look at the not applicable column in this particular case (assuming someone incorrectly enters inapplicable instead of 0 so that there is an inapplicable column) but it has to be in the report so that all rows have the same column set so it's irrelevant whether anyone will look at it or not.

    Tom

  • Tom.Thomson (8/21/2010)


    But since it doesn't rotate, it's rotational speed is 0 radians/sec or 0 rpm or 0 whatever unit you are using. So the correct value is 0, not N/A.

    I'm not sure I would agree with that. A value of 0 would indicate to me that the disk can rotate, but wasn't at the point the data was collected. Whereas N/A would suggest to me that it never does (except in the context of planetary rotation, or a tornado 🙂 )

    Going back to the car scenario, maybe we could have two columns (miles, and units) with units recorded as "per gallon" for a petrol car, and "per kilowatt" for an electric one. Perhaps the issue is less in how you define NULL than how you define the schema.

    Chris

  • In electronics we have a similar T/F/N situation which we refer to as tri-states or tri-stating. True is a voltage, False is a ground (0V), and the tri-state is the state “neither true nor false” - the “high impedance” state. Here a connected is made but it terminates at an “infinitely dense” point – an undefined value. There is also another state called “floating”. A floating state is unconnected and may have any value (including the undefined) at any point in time (quantum like). When I see discussions about nulls I usually view the discussion as is it an “undefined yet terminated” state (NULL is a non-definable value (cannot be “no value”)) or is it a “could be anything at any time” state (NULL could be unknown at this point but may change later).

    For nulls I like the negative type of definition – not what it is (which is hard to define) but what it isn’t “Not true and not false” - which is easier (or a copout:)). My personal preference is to avoid nulls when ever possible and comedown more on the terminated side. Usually a “not null” with a default works well if you can have a default that can mean a “nothing” in the business domain; often a zero or empty string. The tough case though for defaults is dates. There is no real date value that means nothing.

  • jcrawf02 (8/20/2010)


    One last comment, and then I'll leave it alone.

    I think it's over-theorizing, since the user won't care about the rotations on the SSD. Users only look at data in order to make a decision about taking an action. If there's no action to be taken by the data presented, then it's just 'ooh, look at the pretty colors'.

    Therefore, you might as well include how many bunnies can be stored on your disk, or in the car, or how many square feet are in the interior surface area of the gas tank.

    Making three distinctions between data that just plain isn't useful, no matter what you call it, doesn't actually help the user.

    If they want to see the ones where the data hasn't been entered yet, and that's the report they want, they'll ask for that, but I'll bet they'll only want to see records that it APPLIES to, and they'll already have a criteria for how you should find that, which you would just include in the report design, no need to pull everything in the database that has a value of '-'.

    There, I'm done. 😉

    jcrawf02, the main post has been updated to reflect your position. Let me know if you think it should be improved somehow. Thanks for your input.

    I'd also like to hear from anyone who doesn't like NULLs at all. Maybe I'll bring in some of the points from the other thread where that position was taken.

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • I think you stated what I was thinking well, thanks!

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • mtillman-921105 (8/24/2010)


    jcrawf02, the main post has been updated to reflect your position.

    That made me look again at your main post, and I guess either I missed something before OR the wording has changed in an area which I think is important.

    The question I'm interested in here is

    mtillman-921105 (8/17/2010)


    How are they used in Logic?

    I have an answer (of sorts) to this, which is: so far as I know, they were never used in logic (as opposed to various calculi which were not logic) until Codd proposed a logic with 3rd and 4th truth values i (inapplicable) and a (possibly applicable, but no data available) to handle his system with 2 NULLs. Codd made the mistake of requiring that any set of truth values represented in the database should be the set represented in the logic component of the variant of relational algebra used in the database (he states this explicitly in the 1979 paper; it is a mistake because it would make it impossible to represent in data, for example, both Codd's 4-valued logic and Belnap's 4-valued relevance logic, since the 3rd and 4th truth values in Codd are completely unrelated to the 3rd and 4th values in Belnap, but we may want to store data, including truth values, about systems using both those logics) and then appears to have compounded this mistake by treating his two NULLs as if they were truth values rather than markers indicating absent data and (in the case of one of them, but not the other) the reason for its absence, and that seems to be how he ended up with a logic with 4 truth values. In reality the number of NULLs and the number of truth values can be unrelated. It can be useful to reason about special properties of some of the null values - but it's usually going to be easier to use a qualifier (type of null) column for each nullable column where qualification is meaningful, and do appropriate things with those qualifiers which will not be the same things for all domains, and for which there can be no single universal logic and building this special logic into the relational algebra means having different algebras for diffefrent domains. It may be that "not applicable" is such a frequent case (often due to bad schema design, but sometimes for good reasons) that it's worth having a separate NULL for it as Codd proposed, but there's no point in having a matching truth value since what sort of truth table one needs to deal with it depends on what the application requires be done when inmapplicable attributes are encountered.

    It's clear that Codd's original NULL proposal (a single NULL and a 3-valued logic) had a clear separation between the third truth value UNKNOWN and the NULL. It's probably a pity that this clear separation wasn't maintained when he proposed instead having two different NULLs - two NULLs and a intuitionistic/constructive logic is a workable system, and avoids the issue of listing all the truth values, and two nulls and a 3 valued logic (with UNKNOWN as the third value) is workable too, with logic outside the algebra (ie programmed using the algebra, not included in the algebra) for any handling of the distinction between the two NULLs.

    I seem to have drifted a bit into another question, not in your list: what logic is appropraite for the relational algebra to use if it recognises various types of NULL?

    Tom

  • I think I follow Tom, though I'm not sure.

    I do agree that a qualifier of some sort, different for each domain, may be what is needed. In practicality, the issue is how do you document this and ensure the knowledge is spread. The lack of an easy way to document a column (extended properties haven't worked well for me, a bit cumbersome), is the main issue.

  • Tom, I haven't researched this as much as I would like and will say more later, but even Charles Peirce, a highly respected American Pragmatist, was experimenting with three valued logic long before Codd, around the year 1910. See for example, http://plato.stanford.edu/entries/peirce-logic/.

    Of course this was not done in the context of RDBMS, but the point I'm trying to make is that the logic of NULLs, understood as unknown values, is not entirely new.

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • Steve Jones - Editor (8/25/2010)


    I think I follow Tom, though I'm not sure.

    I do agree that a qualifier of some sort, different for each domain, may be what is needed. In practicality, the issue is how do you document this and ensure the knowledge is spread. The lack of an easy way to document a column (extended properties haven't worked well for me, a bit cumbersome), is the main issue.

    I've often wanted some kind of system for documenting tables too. Of course you can have a data dictionary, but all too often there's not enough time or interest in maintaining one - especially for every column of every table in a database. This is important - I'll add it in the form of a question in the main post.

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • mtillman-921105 (8/25/2010)


    Tom, I haven't researched this as much as I would like and will say more later, but even Charles Pierce, a highly respected American Pragmatist, was experimenting with three valued logic long before Codd, around the year 1910.

    Yes, Peirce worked with a 3-valued logic - truth values T (true), F (false), and V (unknown) - there's no NULL in there, because unknown as a truth value is not NULL. Peirce may have been the inspiration for Codd's use of 3VL in Codds original NULL proposal, but certainly not for two distinct NULL truth values which happened to match two distinct NULLs, as in Codd's later proposal.

    Of course Peirce wasn't first to use a multi-valued logic (Aristotle's logic, in at least one of his formulations, had three truth values: "necessarily true", "necessarily false", and "contingent"). Neither of course was Codd, and I didn't suggest he was - what I said was that he was the first to use NULLs as some of his truth values, which is a very different thing. Peirce may have been the first to use 3 valued logic in modern times (for logic, "modern times" is from 1847 on - that was the year in which De Morgan and Boole first published treatises on logic; and of course De Morgan also invented/discovered relational algebra, so should be better known to RDBMS people than he is); but Lukasiewicz published a paper on Aristotle's logic in 1910, and that may well have discussed a 3-valued logic in which case it's a toss-up whether he or Peirce was first. (Lukasiewicz was of course one of the most famous 20th century contributors to the theories multi-valued logics, and the whole subject of MV-Algebras was developed to provide a semantics for his logics, and we might not have acquired a formal theory of fuzzy logics without his work; but that was all a bit later.)

    Tom

Viewing 15 posts - 31 through 45 (of 67 total)

You must be logged in to reply to this topic. Login to reply