February 25, 2008 at 8:28 am
Andrew (2/25/2008)
In the real-world the possible answers to a truth question T might indeed not be just "yes" or "no" but might be, say, one of "Yes", "No", "Don't Know yet", "Tell you tomorrow", "We will never know", "Not knowable".If you decide to model the real-world answer to the truth question T by holding the answer in a column of datatype BOOLEAN then you are arguably forced to introduce NULL (or something similar) to represent all the answers OTHER THAN "YES" and "NO". (Incidentally you have therefore lost all the other meanings as well so you now cannot ask the question "how many peope replied 'Tell you tomorrow' "... )
One might argue therefore that one has chosen to create a model that does NOT map to the real-world?
Or one might argue that this is a limitation of the data types available in the database?
Not everything is quantifiable at a given time, even arguably - the very reason you don't know something. In a lot of scenarios, the number of permutations as to why it's not known/not valid are too numerous to list, and ultimately aren't germane to the overall goal. Again - having a list of 20,000 possible reasons to track every possible permutation is phenomenal, but it's applicable to exactly zero "real-life" scenarios I can think of.
The fact that the data element isn't valid at this time is however. Also - what exactly is the benefit to come out of using NaN, or either of the inifinity (which aren't a value by definition) over a somewhat more "generic" Null? It would be rather easy to draw the conclusion that because NULL doesn't have a value, it can be of ANY type. You'd end up with exactly the same kinds of logical intricacies you already have with NULL (NaN is neither equal nor not equal to any value including itself). If you wish - you ALREADY have a "NaN" - we call it NULL.....
Like I said before - using some of the tricks you listed above should be limited to dealing with "important" items that are not known, and where the void being introduced is too large to be acceptable in the given business scenario. The overhead involved in tracking down every potential unknown is fruitless, wasteful, and quite honestly, academic. When something "becomes" too important to just be "NULL" (when the requirements change or the circumstances change), THEN you deal with it.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 25, 2008 at 9:03 am
Reductio ad absurdum ("20,000 possible combinations") does not hold here because there are plenty of cases where the real -world data should be modeled but either isn't or cannot (easily) be.
One issue is "How do you model the real-world data in a database". Or more precisely "How do you model the data real-world data - which you are interested in - in a database".
Another issue is "How can you tell the answer you get from the database is to the answer to the question you (thought) you asked?"
Consider a simple table with columns NAME, AGE.
Row 1 = "Fred", 50
Row 2 = "Jane", NULL
Now formulate SQL to answer the questions
"How many people are under 3 years old ?"
"How many people are over 50?"
"How many people are exactly 50?"
The answers will depend on exactly how you formulate the SQL (SQL has many more than one way of formulating any query and they are NOT necessarily all equivalent)...
... and on whether ANSI NULL is SET
... and on the flavor and version of SQL you are using
... and whether you used an outer join to another table to get the answer (in a more complex case)
... and of course the SQL ooptimisermight transform your formulation anyway
NOW that is a simple query. Consider a complex query joining many tables with many constraints, relationships, correlated sub-queries etc.
How confident are you that you can formulate a query to pprecisely answer a particular question?
How would you know if it was "wrong"?
(BTW testing is NOT adequate here because the ooptimizer may construct a new query plan - for many reason - thus invalidating any previous tests)
February 25, 2008 at 9:04 am
NULL means I have no #$%#$ idea what to put here (yet).
So for end dates it means we don't know what it is, but we don't have it entered in the system. So there's no end date as far as the system is concerned.
I allow NULLs and account for them in design. A null means "xx" in "yy" table. That meaning could include the fact that the people working on the system (clients, data-side) are introducing PEBKAC errors.
February 25, 2008 at 9:19 am
How confident are you that you can formulate a query to pprecisely answer a particular question?
How would you know if it was "wrong"?
It goes back to exactly what I said... how did "YOU" define it in the system? What does the design doc say it's supposed to mean? There's no black and white answer until someone has the requirement written down in the design doc and commented the code to match.
The rest is all just semantics on whether or not you actually believe null is useful or not and, in the absense of concrete requirements, one opinion is just as valid as another at this point.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 25, 2008 at 9:30 am
Andrew (2/25/2008)
Consider a simple table with columns NAME, AGE.Row 1 = "Fred", 50
Row 2 = "Jane", NULL
Now formulate SQL to answer the questions
"How many people are under 3 years old ?"
"How many people are over 50?"
"How many people are exactly 50?"
Assuming we went with your - solution, we'd now have one of the following answers:
Scenario1 :
Row 1 = "Fred", 50
Row 2 = "Jane", NaN
(Same issues as Null)
Scenario 2:
Row 1 = "Fred", 50
Row 2 = "Jane", -infinity
This will now give you issues with your less than 3 years old.
Scenario 3:
Row 1 = "Fred", 50
Row 2 = "Jane", +infinity
now you have a problem with your "over fifty" question.
Scenario 4:
Row 1 = "Fred", 50
Table "Not entered" or "Age Not known"
Row 2 = "Jane"
...and you STILL don't have any better an answer than you did before.
Like it or not - if the data is NOT present, then in most cases - it is NOT better to "pretend" like you had something. Data is not perfect, and unless you operate in a vaccuum with no inputs from the "real world", will never be.
I guess my challenge to you would be - in this particular example - what IS the right answer? refusing to take in Jane's info without an age? do you really think that fits unversally?
Is it better to know that there's an instance with missing information, or not to account for an instance altogether? Which one is a better representation of the real-world data you're trying to capture?
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 25, 2008 at 11:00 am
ACTUALLY I was hoping to get a poll of what people are actually doing to mitigate NULLs - how do I do that?
I am happy to discuss NULL as well - I know it is an old chestnut but I guess the fact that the "debate" continues shows how intractable an issue it appears to be.
Steve Jones wrote " null means "xx" in "yy" table."
Which implies that NULL can only have one meaning in yy table?
Relations (supposedly anyway) use relational algebra - which is based on first-order predicates (which can only be TRUE or FALSE).
In the simple table with columns NAME, AGE.
Row 1 = "Fred", 50
Row 2 = "Jane", NULL
What is the answer to the predicate "Is the AGE of the person with NAME [Jane] greater than 50?"
What (single) meaning would you give to the NULL?
What would the data mean (more precisely what would be being modeled):
...IF I tell you Jane has not been born yet?
.. IF I tell you that Jane is Chinese ?
.. IF I tell you that Jane is actually dead?
.. IF I tell you that Jane does not herself know her age?
.. IF I tell you that Jane would not give her age?
.. IF I tell you that Jane gave her age - but she is a movie star and always lies so her answer was not entered?
.. IF I tell you that Jane has not been asked her age yet?
.. IF I tell you that Jane is a fictional character?
.. IF I tell you that Jane is my pet hamster?
What confidence - in any of these scenarios - would you give to the statement "The average age is 50" ?
(Average age of what is a good starting point!)
Steve Jones wrote " introducing PEBKAC errors."
What is PEBKAC ?
February 25, 2008 at 11:30 am
Sorry - I was imprecise in what I wrote. I meant to say that DOUBLE can be an example of a domain which contains more than just primitives.
Then I proposed a new domain for an extended "Boolean" (so to speak)
For age my proposed domain named "HUMAN BEING AGE" would include:
* all the positive integers between 0 and 200
* Not Known
* Not Known Yet
* Not knowable
* Not applicable
My operators would be such that one could ask questions (of the data) such as
* What is the average age of all those that have an age?
* What is the average age of all ?
* What proportion are not knowable ?
* Is Jane's age the same as Eric's? (NOTE that now a True/False answer would always be returned)
etc.
NOW the point here is not that you MIGHT or MIGHT NOT want to record the Human Age data this way. The point is that all the all values in my "HUMAN BEING AGE" data type are just that - values. Whereas NULL is NOT a value.
So now I do not need to even consider the impact of NULL - no COALESCE, no IS NULL, no SET ANSI NULL, no issues with keys or indexes, no anything ! (NULL problems If you will :D)
Modeling (some of) the properties of the real-world data I wish to capture in the data types itself ELIMINATES the need for NULL.
Incidentally a general issu with SQL is exemplified as follows:
Define table B as:
NAME, AGE (Integer)
"Fred", 50
"Jane", NULL
Define table C as:
NAME, NUMBER_OF_CHILDREN(Integer)
"Fred", 3
"Jane", 1
SQL would allow a join " ON B.AGE = C. NUMBER_OF_CHILDREN" despite the fact that this makes no sense in the real world. This is because the domain Integer is most defintiely not the real-world domain for either AGE or NUMBER_OF_CHILDREN. (For a start - Integer has values MUCH larger than needed and allows negative numbers so you would need to put check constraints on the columns....)
February 25, 2008 at 11:31 am
Andrew (2/25/2008)
ACTUALLY I was hoping to get a poll of what people are actually doing to mitigate NULLs - how do I do that?I am happy to discuss NULL as well - I know it is an old chestnut but I guess the fact that the "debate" continues shows how intractable an issue it appears to be.
Steve Jones wrote " null means "xx" in "yy" table."
Which implies that NULL can only have one meaning in yy table?
Relations (supposedly anyway) use relational algebra - which is based on first-order predicates (which can only be TRUE or FALSE).
In the simple table with columns NAME, AGE.
Row 1 = "Fred", 50
Row 2 = "Jane", NULL
What is the answer to the predicate "Is the AGE of the person with NAME [Jane] greater than 50?"
What (single) meaning would you give to the NULL?
What would the data mean (more precisely what would be being modeled):
...IF I tell you Jane has not been born yet?
.. IF I tell you that Jane is Chinese ?
.. IF I tell you that Jane is actually dead?
.. IF I tell you that Jane does not herself know her age?
.. IF I tell you that Jane would not give her age?
.. IF I tell you that Jane gave her age - but she is a movie star and always lies so her answer was not entered?
.. IF I tell you that Jane has not been asked her age yet?
.. IF I tell you that Jane is a fictional character?
.. IF I tell you that Jane is my pet hamster?
What confidence - in any of these scenarios - would you give to the statement "The average age is 50" ?
(Average age of what is a good starting point!)
Steve Jones wrote " introducing PEBKAC errors."
What is PEBKAC ?
PEBCAK = "Problem Exists Between Chair and Keyboard". Also sometimes known as id-10-t errors. I.e. human error.
I think the issue being taken is that you start with the presumption that NULL is "bad". NULL has its place, usually dictated by the data you're trying to capture.
Not every relation is diadic, even in mathematics. That is certainly not the case in real-world applications of said relations. Knowing that you're dealing with an unknown is a very useful piece of information.
You can only play within the constraints of the model you're in. If you feed bad information to a system, then it will give you bad results. So what? It's job is not to keep you from lying to it - it's to use the data it has. It can also deal only with data that is valid.
So under those circumstances, the answer is - 1. We know nothing about Jane's age. 2. The average (which by mathematical definition applies to every tuple WITH A VALUE) of the set is 50. The lack of a value means that it can't be counted in any way. The fact that 50% of the tuples have no value is regrettable on a human side, but that doesn't negate or change the system answer at all.
Mitigating NULLs is ultimate a development/system design question. The requirements being represented dictate what can/cannot have. The data model just has to be able to accomodate for the possibility that there might be NULLS in certain places. What you signify with a null, do with it, or account for it - all business decisions within whatever app you're building.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 25, 2008 at 11:54 am
Heh... the correct answer is that no one has yet defined what a NULL in the age column is actually supposed to represent. Therefore, the possibilities are endless until correctly defined.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 25, 2008 at 12:54 pm
Matt wrote " think the issue being taken is that you start with the presumption that NULL is "bad".
Actually I dont have a view in that sense (because its an emotional view about a logical subject).
However I do have these views:
* NULL can cause problems that many database users dont anticipate or understand (or even know about)
* NULL increases the chance that you will get the "wrong" answer to your question
* NULL is illogical since it is not part of relational algebra - and SQL is supposedly a relational datatbase
* NULL is NEVER really needed - there are always simple design decisions that can eliminate the need for NULL
* SQL is an incomplete impelementation of a relational database and NULL "papers over some of the cracks"
* NULL introduces logical inconsistencies which are overcome IMPLICILTY in what SQL does under the covers
AND TO REPEAT
I am mainly interested in what techniques people routinely use to mitigate the risks of using NULL
February 25, 2008 at 12:56 pm
Jeff wrote "the correct answer is that no one has yet defined what a NULL in the age column is actually supposed to represent"
Which presupposes that it only represents a single "thing".
Let us say NULL = "not known".
How can your then avoid some derived table (a "view") having duplicate rows - which IS not possible in a relation?
February 25, 2008 at 1:00 pm
Why would a null be the sole cause of duplicate rows? There are duplicate ages already. There's no difference there. You can have more than one person who is 25 just as you can have more the one person with an unknown age.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 25, 2008 at 1:11 pm
Again - Andrew - what would you advocate as "the correct way" for handling the age issue from above? What do you see as the correct answer? How is it that the lack of information as to Jane's age SHOULD be treated? Sounds like you have a well defined view on that matter - just curious what that would be.
My answer (in case it wasn't clear before) was - don't mitigate the fact that it's null. Without any other info pushing my decision any other way - it's more arbitrary to presume it to be one value over another, so it doesn't get included until there is a valid value. What do you see as the answer?
Also - there are lots of instances of triadic or (higher adicity) relational algebra. Where does the "NULL has no place in Relational algebra" come from?
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 25, 2008 at 1:27 pm
Jeff - I meant 2 rows whose data represented the same entity.
Obviously that would be a big mistake as an entity must be unique !
February 25, 2008 at 1:30 pm
Yep... even if you don't know their age 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 16 through 30 (of 39 total)
You must be logged in to reply to this topic. Login to reply