July 25, 2010 at 10:22 am
PaulB-TheOneAndOnly (7/14/2010)
Amazing documentation, Thank you a lot Tom.I'm saving these jewels on my private library for future reference 🙂
I agree. This is very good information.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 25, 2010 at 12:23 pm
Jeff - I believe you're looking for the October 1985 article series in Computerworld, by Codd, entitled:
Does your DBMS Play by the rules
From Codd's own description it's the first time he went into all 12 rules in a single article. The first article is the base definition of the 12 (and rule 0), with a follow-on article to describe the details and corrolary rules.
As far as I know, it's still under copyright, so I can't post copies.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
July 25, 2010 at 12:32 pm
Tom.Thomson (7/25/2010)
Note that the guaranteed-access feature represents an associative-addressing scheme that is unique to the relational model. It does not depend at all on the usual computer-oriented addressing. Moreover, like the original relational model, it does not require any associative-addressing hardware, even though the need for such hardware was once frequently claimed by opponents of the relational model.The primary-key concept, however, is an essential part of Feature RM-1. Feature RS-8 requires each base relation to have a declared primary key (see Chapter 2). Feature RM-1 is one more reason why the primary key of each base relation should be supported by every relational DBMS, and why its declaration by the DBA should be mandatory for every base relation.
Guaranteed access is surely also important for derived relations as for base relations. Yet by the time Codd introduced the 13 rules he abandoned the idea of relational completeness so that primary keys are no longer required for views. Instead views can have what he called "weak keys" (tuples are "unique" but are identifiable only by nulls in place of values).
This seems like a horrible complication in my view because it means the "relational" database engine then has to deal with two types of data structure - one of which doesn't look like a relation at all. In consequence, different types of storage maybe have to be used for derived relations and base relations unless some compromise between the two can be found. It may also be that some types of query rewrite are no longer possible or are made more difficult because base and derived relations are no longer interchangeable.
In this respect and others the 13 rules depart from the simplicty and consistency of Codd's model as he originally described it - to its detriment in my opinion.
July 25, 2010 at 1:39 pm
Matt Miller (#4) (7/25/2010)
Jeff - I believe you're looking for the October 1985 article series in Computerworld, by Codd, entitled:Does your DBMS Play by the rules
From Codd's own description it's the first time he went into all 12 rules in a single article. The first article is the base definition of the 12 (and rule 0), with a follow-on article to describe the details and corrolary rules.
As far as I know, it's still under copyright, so I can't post copies.
Thanks Matt... I'll check it out.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2010 at 2:04 pm
Heh... the more I learn of this subject and similar subjects especially when the human element is included, the more I realize that there's one and only one premise in all the world that will never ever change...
... "It Depends". 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
July 25, 2010 at 4:15 pm
David Portas (7/25/2010)
Guaranteed access is surely also important for derived relations as for base relations. Yet by the time Codd introduced the 13 rules he abandoned the idea of relational completeness so that primary keys are no longer required for views. Instead views can have what he called "weak keys" (tuples are "unique" but are identifiable only by nulls in place of values).This seems like a horrible complication in my view because it means the "relational" database engine then has to deal with two types of data structure - one of which doesn't look like a relation at all. In consequence, different types of storage maybe have to be used for derived relations and base relations unless some compromise between the two can be found. It may also be that some types of query rewrite are no longer possible or are made more difficult because base and derived relations are no longer interchangeable.
In this respect and others the 13 rules depart from the simplicty and consistency of Codd's model as he originally described it - to its detriment in my opinion.
The database has to deal with conceptual relations and base relations and views, so that's three things not two. Conceptual tables arise when a base relations is expressable as an equijoin of two more fundamental relations - in that case the two more fundamental relations should be defined in the catalogue and the original base relation defined in terms of them (this can happen as a side-effect of the normalisation process). The conceptual relation's primary key is the columns of the primary keys of the new base relationss (with duplicate columns eliminated, of course). This can happen repeatedly so that conceptual tables are several levels from some of the base tables from which they inherit data. I suspect that what you are calling views is actually what Codd called conceptual relations (maybe there was a terminology shift somewhere between RM-1 and RM-T). If so, the restriction on what views can be formed is really draconian and would I think result in an unusable system.
The original model RM-1 (and RM-T and RM-2) contained the requirement for it to be possible to create a view simply by specifying that a retrieval statement that was to be treated as the view. That means that a view could have a primary key only if the retrieval statement could not return duplicates (which is fair enough if one assumes that retrieval eliminates duplicates) and it could have a sensible/useful primary key if the components of that primary key were all (and only) those columns contained in the view derived from the primary keys of relations fed into the view. For this not to require the view to be a conceptual table (ie an equijoin on two more primitive tables, each of which is either a conceptual table or a base table) requires the database to be able to take into account uniqueness constraints other than primary key constraints and of the one-to-one-ness of various operators and functions over the value domains, so that not only does a concept analogous to schema-binding in T-SQL has to be introduced but also the database system may have to reason about arithmetic operations, string operations and so on in a manner which Turing proved to be impossible. It seems clear that insisting on having a primary key would place an extremely strong restriction on the definition of views, so strong in fact as to be unacceptable. Of course one could throw away the ease of use criterion and state that in addition to a retrieval statement the view definition would specify which columns of the view were to form the primary key, but that (a) still leaves one restricted to schema-bound views (so it still requires the schema-bound concept) and (b) may still leave the database unable to verify that the view definition is invalid (because the halting problem is insoluble) and (c) still places excessively severe restrictions on what views can be defined.
As for different logical storage models for views and tables, you are clearly assuming either that views are always fully manifest (even when not being used) or that schema definitions are not held in relations (or most likely both, as I can't see any other way to derive separate logical storage from the idea that a view may fail to have a PK). But RM-1 and RM-T and RM-2 and indeed even the third manifesto (to throw in something you presumably approve of, as it's one of Date's things) require both of those to be false.
Tom
July 28, 2010 at 1:22 pm
I am using the term "view" in the same sense that Codd used it (at least in the RM/V2 book). View meaning derived relation.
My point was simply that in Codd's original relational model every derived relation expressible in Relational Algebra (or its equivalent) was guaranteed to have a primary key. A simple consequence of the fact that relational expressions always return relational results and never return duplicate tuples. So you never had to be concerned with more than one type of data structure or worry about whether derived relations had a key or not. I don't think this is "draconian" or restrictive. It's a very useful property of the algebra.
The rules for deriving keys from relational expressions are quite well understood and widely used. For example I sometimes use the Rel RDBMS, which automatically highlights keys in every query result.
Codd's introduction of nulls in later years compromised this very effective model by (according to Codd) permitting query results which didn't have candidate keys at all.
Tom.Thomson (7/25/2010)
As for different logical storage models for views and tables, you are clearly assuming either that views are always fully manifest (even when not being used) or that schema definitions are not held in relations (or most likely both, as I can't see any other way to derive separate logical storage from the idea that a view may fail to have a PK).
I was talking about physical storage and the mapping from logical to physical (for any view or table). Mapping a logical relational expression into physical storage might have to be done differently depending on whether the expression in question has a key or not. There's nothing fundamentally wrong with a view being materialized in physical storage, provided it behaves correctly in logical terms. That doesn't stop its definition being held in the database catalog as well of course. Maybe I misunderstand you, but those things are not mutually exclusive and I'm not sure why you would think they are.
July 28, 2010 at 5:56 pm
David Portas (7/28/2010)
I am using the term "view" in the same sense that Codd used it (at least in the RM/V2 book). View meaning derived relation.
So in effect you mean any relation that can be derived as an expression in the relational calculus using algebraic an logical expressions and base relations? In that case our disagreement does not arise from an inconsistency of terminology.
My point was simply that in Codd's original relational model every derived relation expressible in Relational Algebra (or its equivalent) was guaranteed to have a primary key. A simple consequence of the fact that relational expressions always return relational results and never return duplicate tuples.
Yes, there will always be a primary key because the view contains no duplicates; however, there is a usefulness issue here: saying that everything in the view is accessible by combination of primary key and column is a bit pointless when the primary key includes every column, which in many cases it will - the user has to know the value in question in order to be able to retrieve it, so the univeral accessibility principle is not something interesting but a boring tautology in those cases. I feel that even when the primary key can be mechanically determined and is is not the whole row it will sometimes not be a useful key (for example any time that it omits some primary key columns of some of the component relations, and does contain some other columns, partly because it is then very difficult for the user to deduce what the primary key will be and partly because, as in the case where the key is the whole row, there is a good chance that the primary key includes what the user wants to look for, amd finally because even when this doesn't apply there is still a good chance that the user will not know the primary key for what he is looking for).
So you never had to be concerned with more than one type of data structure or worry about whether derived relations had a key or not. I don't think this is "draconian" or restrictive. It's a very useful property of the algebra.
But you only get to avoid being draconian by allowing the primary key on the view to be practically useless in many cases. Besides which, I can't understand why you think that the existence or nonexistence of a primary key has any impact on the data structure used for a view.
The rules for deriving keys from relational expressions are quite well understood and widely used. For example I sometimes use the Rel RDBMS, which automatically highlights keys in every query result.
Codd's introduction of nulls in later years compromised this very effective model by (according to Codd) permitting query results which didn't have candidate keys at all.
Tom.Thomson (7/25/2010)
As for different logical storage models for views and tables, you are clearly assuming either that views are always fully manifest (even when not being used) or that schema definitions are not held in relations (or most likely both, as I can't see any other way to derive separate logical storage from the idea that a view may fail to have a PK).I was talking about physical storage and the mapping from logical to physical (for any view or table). Mapping a logical relational expression into physical storage might have to be done differently depending on whether the expression in question has a key or not. There's nothing fundamentally wrong with a view being materialized in physical storage, provided it behaves correctly in logical terms. That doesn't stop its definition being held in the database catalog as well of course. Maybe I misunderstand you, but those things are not mutually exclusive and I'm not sure why you would think they are.
I wouldn't suggest they are mutually exclusive, I thought that what you said implied that EITHER (a) the view has to be manifest whether it's also held as a definition or not OR (b) that the only representation of the view in the database is its manifest extent. I can't see how you would derive your conclusions without one or the other of those premises.
I agree that NULLs are a problem. There are at least four possible approaches: (A) sweep them under the table and pretend that the database is always an accurate representation of the real world, so that it doesn't need to cope with missing data because no data is ever missing, and use a logic system that is black and white and admits no shade of grey; this is the one of the two (mutually inconsistent) Date approaches, I don't know whether it's your approach or not; (B) assume that databases can only contain known data, and when an attribute is such that we sometimes can't provide a value for that attribute in a particular row we must simply exclude that column from being represented in the database - I don't think anyone actually advocates that approach today (but it may be your view - again I don't know); (C) assume that picking a domain-specific default value (eg 0 for integers, empty string for strings, and so on) when the real value is unknown means that an application programme can by some divine omniscience deduce for eample whether "0" means "I don't have the value" or means "0"; this is the other Date approach, again I don't know whether this is your view; (D) require the database to indicate clearly when it doesn't possess a value; this is my view, Codd's view as expressed in all his writings from 1979 onwards, and clearly anethema to you and to Date. I believe that you will have to change your view if you want to adopt a relational model which is useful in the real world instead of an abstract and interesting but for most purposes useless academic toy. Maybe I'm wrong, but I don't think so.
Tom
July 28, 2010 at 8:50 pm
Since Codd rather well endorsed NULLs, why are they considered such an irritant even by some of those that appear to follow his every word with nearly religious zeal? Not that it matters to anyone else but I've learned to use NULLs rather well and will wish a mighty barrage of extremely high velocity, low temperature pork to the wit that insists upon their elimination. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
July 28, 2010 at 10:40 pm
Jeff Moden (7/28/2010)
Since Codd rather well endorsed NULLs, why are they considered such an irritant even by some of those that appear to follow his every word with nearly religious zeal? Not that it matters to anyone else but I've learned to use NULLs rather well and will wish a mighty barrage of extremely high velocity, low temperature pork to the wit that insists upon their elimination. 😉
The relational model is science, not religion. No one person is infalible or has a monopoly over it. I can't think of anyone I've come across who follows Codd with "religious zeal" (although plenty of people have probably been attacked ad hominem for that).
The common criticisms of null are mostly criticisms of Many Valued Logic (3VL, 4VL, ...) rather than what Rule 3 actually says, which is "null values (distinct from the empty character string or a string of blank characters, and distinct from zero or any other number) are supported". In a nutshell:
1) MVL does not solve the problem of missing data
2) It causes results that are incorrect in the real world (a fact which Codd readily acknowledges)
3) The relational model worked perfectly well before Codd suggested adding nulls to it
4) Adding nulls to the picture doesn't add any expressive power, it only adds complexity and contradictions
It seems that people who find nulls useful often base that view on their experience with SQL, just because that is the most common model that implements anything similar to what Codd proposed. However, SQL is not a very good implementation of the relational model. The SQL language won't work without nulls but that doesn't necessarily mean they are useful in the relational model generally. Even in SQL, there are plenty of people, myself included, who create successful database solutions without using nulls in base tables.
July 29, 2010 at 1:05 am
Tom.Thomson (7/28/2010)
Yes, there will always be a primary key because the view contains no duplicates; however, there is a usefulness issue here...
I read what you wrote a few times but I can't see any of your arguments against keys that couldn't be applied equally to base relations as to derived relations. I'm sure that's not what you meant though. It seems to me the important feature is that the system should identify a key for the user for every relation - base or derived. I don't know why you think keys are less important in derived relations. On the evidence, lack of support for keys in SQL views and query results has been hugely detrimental to the use of SQL in many applications.
Tom.Thomson (7/28/2010)
I thought that what you said implied that EITHER (a) the view has to be manifest whether it's also held as a definition or not OR (b) that the only representation of the view in the database is its manifest extent. I can't see how you would derive your conclusions without one or the other of those premises.
I don't think views must be manifest. I think it's useful that they can be (I assume that by "manifest" you mean the data for the view is materialized in physical storage). If the view doesn't have a key then maybe the DBMS has to invent a "uniquifier" for it in storage, just as SQL Server does for clustered indexes on non-unique columns. The "uniquifier" approach has potential disadvantages however because it's another level of indirection that requires an extra lookup during data access. I think it would be better for every view to have a key.
I agree that NULLs are a problem. There are at least four possible approaches ...
You are unfairly misrepresenting the view of Chris Date (and many others). No one is talking about sweeping missing data under the carpet. If data is missing then it is missing. Adding a null to the database doesn't make it any more missing than just not storing that value at all.
The usual reason given for adding a null is not to show that data is missing but to state some other information about why it is missing. For example to show that the data value is missing because it is unknown or missing because it is inapplicable. This is an approximation however, because nulls (in SQL for example) do not accurately represent those cases and therefore the results are inaccurate and inconsistent. Codd's general approach of using N-value logic to represent different reasons why data is missing is inherently flawed. Sooner or later you will have a (N++)th reason why data is missing and so you will have to keep inventing new logics ad infinitum.
My approach is closer to what you have called (D): "require the database to indicate clearly when it doesn't possess a value". I want to indicate missing data very clearly by leaving out the missing value. If however I wish to store additional information about why the value is missing then I would do that very explicitly with another, separate attribute. I would not try to overload that information into the place where the missing data belonged because I want each attribute to have one explicit meaning, not multiple meanings. I don't know why you would think that giving an attribute multiple meanings is clearer than giving it only one meaning, but anyway I disagree.
July 29, 2010 at 6:23 am
Jeff Moden (7/28/2010)
Since Codd rather well endorsed NULLs, why are they considered such an irritant even by some of those that appear to follow his every word with nearly religious zeal? Not that it matters to anyone else but I've learned to use NULLs rather well and will wish a mighty barrage of extremely high velocity, low temperature pork to the wit that insists upon their elimination. 😉
Actually the people who object strongly to NULLs appear not to think much of Codd and believe instead in the work of Date. Most of them were taught about relational databases using Date's textbook, which is used (unfortunately) in many universities, and have never thought about the relational model for themselves, just parrotting (often inaccurately) what they remember from their undergraduate years. Date disagreed with Codd about many issues, not least of which was NULLs. If you look at the Date's "Much Ado" paper (thanks due to Hugo for providing the URL for that: http://www.dbdebunk.com/page/page/1706814.htm) you may be able to see what some of the other issues of disagreement were (and get some idea of the lack of rigour in Date's arguments). Of course Date (unlike Codd) was heavily involved in SQL/DS and DB2 design, which ought to cause a relational purist to regard him as the devil's emissary, but for some reason they all forget that.
Tom
July 29, 2010 at 7:07 am
I've heard such things of Date which is surprising since Codd and Date were business partners at one time (IIRC). I'll have to take a look at the link you provided.
In my searches for more on Codd, I found a great article by Codd that's more than decade earlier that the famous 12 rules article and found it to be much more interesting to boot... I've read it once and now I'm going back to study it more indepth... (hope this URL actually works for you)...
--Jeff Moden
Change is inevitable... Change for the better is not.
July 29, 2010 at 7:50 am
David Portas (7/29/2010)You are unfairly misrepresenting the view of Chris Date (and many others). No one is talking about sweeping missing data under the carpet. If data is missing then it is missing. Adding a null to the database doesn't make it any more missing than just not storing that value at all.
The usual reason given for adding a null is not to show that data is missing but to state some other information about why it is missing. For example to show that the data value is missing because it is unknown or missing because it is inapplicable. This is an approximation however, because nulls (in SQL for example) do not accurately represent those cases and therefore the results are inaccurate and inconsistent. Codd's general approach of using N-value logic to represent different reasons why data is missing is inherently flawed. Sooner or later you will have a (N++)th reason why data is missing and so you will have to keep inventing new logics ad infinitum.
My approach is closer to what you have called (D): "require the database to indicate clearly when it doesn't possess a value". I want to indicate missing data very clearly by leaving out the missing value. If however I wish to store additional information about why the value is missing then I would do that very explicitly with another, separate attribute. I would not try to overload that information into the place where the missing data belonged because I want each attribute to have one explicit meaning, not multiple meanings. I don't know why you would think that giving an attribute multiple meanings is clearer than giving it only one meaning, but anyway I disagree.
If anyone attempts to use SQL's single null to indicate why data is missing they are being very stupid. Codd explicitly required NULL (and later the A_NULL mark) to carry no meaning other that a value is not present in the database. So it seems that you, myself, and Codd all agree on that point. (There may be cases in which the chance of something being absent for a reason other than that it is inapplicable is negligible; using null to indicate absence in those circumstances can be regarded as using null to indicate inapplicability. But I don't much like seeing that in a single NULL system.)
You say you want to indicate missing data very clearly; I prefer "absent" to "missing", because "missing" implies applicable, which "absent" doesn't; but I assume that you didn't intend that implication.
Anyway, how are we to indicate that a datum is absent?
We can't use a "default value" which could also represent a datum that is present, as repeatedly advocated by Date from 1982 onwards (maybe he started earlier; I don't know) because we then don't know whether the value is absent or not.
I guess we can leave out the whole row - but now a lot of other data which could be present is forced to be absent, unless we invent a new table whose sole content is the primary key of the original table and the single colum some of whose data may be absent; this would lead immediately to a rule that no table may have more than one column which is not part of the primary key; but even then we have a failure - we can't get all the infomation about the thing which a particular value of the primary key denotes by doing an inner join on these tables, since if any single attribute is missing the inner join will not have a row for that primary key value; so we'll have to use an outer join instead - but oh, wait a minute, we now have a row in which one of it' vlues is absent, which this splitting into small tables was designed to avoid. So that approach fails.
Perhaps we should leave the datum out of the row? How do we model this? We could say that each datum is tagged with an identifier of the column it represents, and if the tag for a particular column is not present then the corresponding datum is absent; or we can say that for a column whose datum is absent we use a mark in the row that indicates that this datum is not present. I guess the former is what you weant to do, and of course the latter is the NULL mark. So we have to ask whether there is any real practical difference between these two cases - in particular can the first case be handled using a 2-valued logic (we know the second cannot).
Suppose relation T1 has attributes A (the primary Key) and B (some other value - an integer) and perhaps some more columns which don't affect the argument; and relation T2 has attributes A (the primary key) and C (some other value - an integer again) and perhaps some more columns. Let's look at what happens when we try some logic on absent values. Consider the case where for some value a0 of A , both tables have a row with primary key a0, B is present in that row of T1 and C is absent in T2 in that row of T2. In the course of some operatiop in the relational calculus we have to evaluate B<C, B=C, or B>C for those two rows. Suppose we use two value logic. Then one of these three prepositions must be true and the other two false; which shall we choose to be true? There is no means of deciding - one of them may be true in the real world, but we don't have the datum in the database so we don't know in the database which one that is. That is the whole point: in the database, it is UNKNOWN in this case whether B<C, or B=C, or B>C so we can't logically assign TRUE to any of them; neither can we assign FALSE to any of them. So we can either require the relational operation to fail or we can assign a third truth value, UNKNOWN to each of them. There is of course an issue with prepositions like (B<C or B=C or B>C), as recognised by Codd - tautologies are always true (actually this is not the case when data is absent because it is "inapplicable", and that is probably why Codd introduced his I-NULL to form a 2-NULL system and his 4-valued logic to replace the 3-valued logic of his first proposals on absent data; but it's easy to convert a schema which needs I-NULL to one in which it can never occur, indeed some people refer to I-NULL elimination as a form of normalisation, so we needn't worry about that) so it might be useful to include some tautology detection in the algebraic engine. Anyway, the recognition that data can be absent appears to force one to use a logic with more than two truth values. The consequences of absent data are unpleasant - the database sometimes delivers wrong answers. Of course if we use a two-valued logic to handle absent data, the database will still sometimes deliver wrong answers (because it will treat things as TRUE which may in fact be FALSE, and/or vice versa). And if we use default values (the Date approach) instead of representing absence of data accurately the database will sometimes deliver wrong results - and even worse, the database will in that case be unable to indicate when there is a risk of a wrong result, because it will not be able to detect when data is absent, and the user can not know whether or not there is a risk that the result is wrong (the "modify default" issue enforces this even if each domain rather than each column has a default value).
So we wind up needing a multi-valued logic, and either leaving out columns in individual rows or marking columns as absent in individual rows. It's actually rather convenient (from a usability point of view, which is what should probably decide the matter, and also from a relational calculus engine implementation point of view) to use a mark, and hold and display it where if present the column value would be held, because then people (and algebra engines) can see at a glance just which data is absent. There's also the question of UPDATE statements - if we don't use a maker we will need some clever mechanism for the user to say something like "as a result of this update the data in table T1 and column X for primary keys K1,....Kn must be made ABSENT even if they were previously present". I think a phrase assigning the quality of absence to a column (like "set C=NULL") is going to need some syntactical device to denote that quality, and I can't see anything wrong with calling that syntactical device "NULL" and using the same term to denote either an "absent datum" mark or (if someone chooses to implement "absence of datum" that way) to denote the absence of a range element from a map.
Tom
July 29, 2010 at 8:58 am
Jeff Moden (7/29/2010)
I've heard such things of Date which is surprising since Codd and Date were business partners at one time (IIRC). I'll have to take a look at the link you provided.
Yes, they set up a consulting firm together - disagreeing on technical stuff didn't mean they fell out on a personal level.
In my searches for more on Codd, I found a great article by Codd that's more than decade earlier that the famous 12 rules article and found it to be much more interesting to boot... I've read it once and now I'm going back to study it more indepth... (hope this URL actually works for you)...
That article was the first publicly available one on relational databases, what it describes is what people like me and David Portas refer to as RM version 1. It's what started it all off (there was an internal IBM article a year earlier, but the impact was nothing like that of the ACM article, because there was so much opposition in IBM). Thanks for posting the link.
Tom
Viewing 15 posts - 16 through 30 (of 106 total)
You must be logged in to reply to this topic. Login to reply