February 24, 2008 at 11:53 am
Which of the following techniques do you commonly use to mitigate problems caused by NULL ?
As part of a UK postgraduate MSc in computing I am investigating the causes, problems, and ways of mitigating problems that the use of NULL in relational databases can produce. (And as a separate but linked topic - the use of NOTHING in object-oriented programming)
The options I have given are not exhaustive and are not necessarily self-explanatory so I welcome correspondence on any techiques for dealing with NULL (and NULL generally)
Although this question is about Microsoft SQL Server (In all its flavors I would be delighted to hear about other relational database.
February 24, 2008 at 1:55 pm
Many developers use nulls because the don't know what data is needed.
The simple question "does it matter if there are spaces or a zero if you didn't get the value ?" isn't being posed !
Actualy for a column to be null should be the exception !
So: not null by default is the guideline
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
February 24, 2008 at 2:27 pm
Hi
NULLs confuse people...
Let's talk for a minute about what exactly NULLs do that cause this type of reaction. The first problem is that the definition of NULL is "unknown." So, one problem is determining whether one value is (not) equal to another value, when one or both values are NULL. This trickles down to many problems for a database engine and any associated applications. The following list details some of those problems:
? they are interpreted differently depending on compatibility level and ANSI settings;
For example, let's consider two values, x and y, that are both NULL. Since the definition of NULL is unknown, then you can't say x = y. However, with the ANSI setting ANSI_NULLs, this can be different. When this setting is FALSE, x = y ... however, when TRUE, x <> y. Confusing, no?
? the storage engine has to do extra processing for each row to determine if the NULLable column is in fact NULL -- this extra bit can have performance implications;
? they produce weird results when using calculations, comparisons, sorting and grouping;
? they create problems with aggregates and joins, such as different answers for COUNT(*) vs. COUNT(fieldname);
? they produce unpredictable results in statistics computations, particularly WITH ROLLUP and WITH CUBE;
? applications must add extra logic to handle inserting and retrieving results, which may or may not include NULL values;
? they cause unpredictable results with NOT EXISTS and NOT IN subqueries (working backwards, SQL determines that NULL columns belong or do not belong to the result set, usually for the wrong reasons);
? no language that supports embedded SQL has native support for NULL SQL values.
Thanks -- Vj
February 24, 2008 at 9:59 pm
None of your questions in the poll pertain to the actual use of nulls. Nulls simply mean "unknown". They are a powerful tool that are frequently misunderstood.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 25, 2008 at 12:06 am
Jeff - In which ways can you deal with the problems caused by NULL then - for example the fact that TRUE AND NULL = NULL or 6 * NULL = NULL which may, or may not, be the "answer" you want.
(As an example WAGE + COMMISSION = TOTAL SALARY. If WAGE = 1000 and COMMISSION = NULL because it is "not known yet" then TOTAL SALARY = NULL, whereas common sense says the "correct" answer is TOTAL SALARY = 1000)
February 25, 2008 at 1:19 am
In numeric expression think of NULL as infinity (for some expressions) rather than 0, as 0 is an exact value, infinity is not:
[something] + [infinity] = [infinity]
[something] * [infinity] = [infinity]
[something] / [infinity] = 0; in SQL it's null
February 25, 2008 at 1:41 am
I was about to post the same thing - from a maths background you can consider null in similar ways to infinity.
Infinity <> infinity, Infinity * X = Infinity.
I think some people were bitten by a poor understanding of NULLs way back when and are overly passionate about not using them in DBs now. I sometimes create child tables rather than have banks of fields that could be NULL, but also am happy to have a field-type switch if it is just one or two fields that could go NULL based on the "type" of the record.
The NULL rules aren't that hard, and even if you avoid them like the plague, outer joins still bring them back 🙂
February 25, 2008 at 1:53 am
"numeric expressions ....NULL can be thought of as infinity"
Infinity is mathematically a number.
What is the equivalent "can be thought of" for DATE, VARCHAR values?
February 25, 2008 at 4:53 am
Infinity is not a number.
It's not equivalent to null either. The example was meant just as a method of presentation. To let you understand what is null like. It's something inexact, undefined, just like in math is infinity or 0/0.
Null in sql is missing/undefined value, so aggregates skip them. For SUM() it wouldn't matter if null was treated as 0, but for AVG would make a big difference. Existence of NULL is powerful feature (in some cases essential), if understood and used properly.
February 25, 2008 at 6:07 am
Andrew (2/25/2008)
Jeff - In which ways can you deal with the problems caused by NULL then - for example the fact that TRUE AND NULL = NULL or 6 * NULL = NULL which may, or may not, be the "answer" you want.(As an example WAGE + COMMISSION = TOTAL SALARY. If WAGE = 1000 and COMMISSION = NULL because it is "not known yet" then TOTAL SALARY = NULL, whereas common sense says the "correct" answer is TOTAL SALARY = 1000)
"Which may, or may not, be the "answer" you want". That's the real problem with NULL... lot's of folks don't know what to do with it. What would you have in its place in the examples given? Zero? Wrong answer would be given in one case and the correct answer in another. Why? Definitition of the problem. NULL Commissions, for example, should never be allowed because if there is no commission, you know that must be NO commission which is a known value of zero. That's where a NOT NULL contrstraint should be. But, by the same token, NULL Commissions may indicate that the person has simply not been processed for Commissions, yet. That could be a crucial piece of information.
What about "end dates". Should they be NULL or they be some date way in the future like 99991231? There's arguments for both ways. It's super easy to exclude non-enddated items just by doing a comparison with any date because the NULL enddates will inherently be excluded.
NULLs are important to counts and averages. You don't want to necessarily assign the value of 0 to items in a count... it is possible that you want to "skip" the count of items "not filled in" and NULLs allow you to do that just like empty cells in a spreadsheet do.
In my humble opinion, the best way to deal with NULLs is to plan for their use. You need to plan on what not knowing something means for every column they can appear in. You need to plan on assigning a condition to the value of NULL for each column where it can be used. You must understand what it means when something has a totally unknown value... then, you're dealing with NULLs.
Heh... and remember... "NULLs Happen" :hehe:
--Jeff Moden
Change is inevitable... Change for the better is not.
February 25, 2008 at 6:28 am
Hi Jeff
You used two different meanings for NUL: "not been processed [for Commissions] yet" and "not known"
Or presumably therefore when you say "You must understand what it means when something has a totally unknown value" by extension you mean "You must understand what it means when something has a totally unknown value OR NOT BEEN PROCESSED YET"
As an example of the complexity NULL can introduce - which of these two meanings do you want each aggregate function to "skip"?
You stated "It's super easy to exclude non-enddated items just by doing a comparison with any date because the NULL enddates will inherently be excluded."
What if NULL was to be used in a data type column with the two meanings you have so far used for NULL ?
BTW who is "YOU" ? If "YOU" is an experienced DBA or database developer then YOU may possibly be able to take all precautions against all NULLs.
What if YOU is NOT experienced ( I suspect the vast majority of SQL users are not experienced) ? WIll they know their queries have produced the "wrong" answer. (Or more precisely and hopefully perhaps the right answer to the wrong question)
BTW the query optimizer is, in a sense, a user.
In the case where there are NULLS in the database - how confident are you that for ALL transformations which the optimizer performs for ALL submitted queries that the result of the transformation gives exactly the same results as the original query? Can it be proven (logically or mathematically) that all transformations maintain equivalence?
February 25, 2008 at 7:42 am
Yeaup... I agree... and that's what I meant. You must define what NULL means as part of the design of the table and you must understand what it means to that column. Who is "YOU"? YOU may be a team responsible for the design of the tables or the guy/gal running the whole shootin' match for a small company. If you don't define the meaning of NULL as it can be used in any column, you will probably run into some confusion in the future.
One of the problems with NULL is that, like just about any other data, you can define the "semantics" of NULL. You might say, "NULL is in the eyes of the beholder".
Like I said, the best way to deal with NULLs is simply to define what they mean for any given column. Another way to deal with them is to simply not allow them to happen. I think they can be useful so I don't make that particular recommendation.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 25, 2008 at 7:51 am
As much as I'd like to say that the positions espoused by our friends CJ Date and Fabian Pascal are wonderful and true, and that NULL should be banished into the pits of computing hell never to be seen again - I can't, because it would involve a computational complexity and a lack of scalability making any solution they've ever come up with unusable in anything other than the most simplistic of scenarios. they also have some very large flaws which quite honestly invalidate pretty much any gains our 2 musical geniuses think they've added to the mix.
The bottom line is - there are lots of times in real life where you start out with incomplete information, inaccurate info. It gets refined as you go along, but there has to be a mechanism to allow for something to represent UNKNOWN/NOT ENTERED/NOT YET PROCESSED. You could "force" them to enter everything until it was perfect, and refuse to allow any unknowns, but that just doesn't fit a lot of scenarios. Think of a data entry system in an Emergency room - they enter what little they might have on the unconscious person of undetermined age with no wallet who just got wheeled in in a trauma scenario: do you really think it's appropriate to wait until we KNOW who that/what is age is/insurance info/payment info is before ordering some tests?
Your second option is to force entry of default values. Again - this introduces serious overhead that then needs to be sorted out: was this question "yes" because someone picked "yes", or was it defaulted in. It also forces you to "exclude" certain values as "invalid values" depicting whatever flavors of unknown you want to know about: what if it is a numeric range with no well-defined invalid range (or no invalid range of values at all)?
Now - in those circumstances where the "UNKNOWN" is unacceptably large, then yes - it might be appropriate to have something to further refine WHY it's unknown. But again - there are lots of scenarios where it's just not possible to know WHY it's not known at the time. Think problem resolution - you might not know what the answer is (which is the whole reason you're working on it).
Bottom line comes down to this - there are mechanisms around using NULLs, but I'd venture to say that HAVING NULLs is at times a truer representation of the scenario you're trying to document than something artificial being thrown in to artificially fill a void.
Voids do exist in nature - there needs to be a way to show that.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 25, 2008 at 8:00 am
By the way - I'd like to post a "null" value to your poll - no answer really applies to my scenario (or rather fully represents my situation).
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
February 25, 2008 at 8:13 am
The values in a table column have a type (or domain if you prefer in the non-SQL meaning of the word) which defines, loosely, the possible values the column may have. For example type INTEGER is the set of numbers consisting of the natural numbers and their negatives.
In the real-world the possible answers to a truth question T might indeed not be just "yes" or "no" but might be, say, one of "Yes", "No", "Don't Know yet", "Tell you tomorrow", "We will never know", "Not knowable".
If you decide to model the real-world answer to the truth question T by holding the answer in a column of datatype BOOLEAN then you are arguably forced to introduce NULL (or something similar) to represent all the answers OTHER THAN "YES" and "NO". (Incidentally you have therefore lost all the other meanings as well so you now cannot ask the question "how many peope replied 'Tell you tomorrow' "... )
One might argue therefore that one has chosen to create a model that does NOT map to the real-world?
Or one might argue that this is a limitation of the data types available in the database?
Now the type DOUBLE may also be defined to include NaN (Not a number) and NEGATIVE INFINITY and POSITIVE INFINITY - although these are not actually "numbers" as such.
So one might go further and define an extended Boolean type, say, "MyBoolean" with possible values ""Yes", "No", "Don't Know yet", "Tell you tomorrow", "We will never know", "Not knowable".
(Of course you would also need to define MyBoolean operators such as "=" ,"<", "AND" etc)
Viewing 15 posts - 1 through 15 (of 39 total)
You must be logged in to reply to this topic. Login to reply