"Real" copy of Codd's 12 (13) Rules for RDBMS

Question

"Real" copy of Codd's 12 (13) Rules for RDBMS

Viewing 15 posts - 31 through 45 (of 106 total)

You must be logged in to reply to this topic. Login to reply

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 1

Jeff, I'm also interested, but before I even start looking I'm getting up to speed on set theory. In particular, understanding cartesian products, relations, functions, function composition, function limitations, etc.

Random Technical Stuff[/url]

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 2

ta.bu.shi.da.yu (7/29/2010)
Jeff, I'm also interested, but before I even start looking I'm getting up to speed on set theory. In particular, understanding cartesian products, relations, functions, function composition, function limitations, etc.

Here's a bit of understanding for cartesian products and why they are not "set based" even though they have no explicit loop and no explicit recursion... lots of folks make the mistake of using such a thing for their blogs and articles on running totals. 😉

http://www.sqlservercentral.com/articles/T-SQL/61539/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ta.bu.shi.da.yu Hall of Fame Points: 3985 More actions · Answer 3

A cartesian product is set-based... it's a product of sets... of course on a database agreed - you really want to avoid it 🙂 But it's part of a foundational understanding of set theory.

But that running total on my blog is totally silly 🙂 A definite triangular join! Your running total method is still the one to beat... I believe that Oracle has some analytical functions that will easily allow you to do running totals, if only the SQL Server team would implement them!

Random Technical Stuff[/url]

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 4

Heh... I guess it depends on what one's definition of "set based" is. Either way, I sure appreciate your feedback.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 5

Tom.Thomson (7/29/2010)
If anyone attempts to use SQL's single null to indicate why data is missing they are being very stupid. Codd explicitly required NULL (and later the A_NULL mark) to carry no meaning other that a value is not present in the database. So it seems that you, myself, and Codd all agree on that point.

Thank you. I will save your comment and post a link to it the next time someone on SQLServerCentral tells me for the 1000th time that "Null means unknown SQL"! Obviously that is nonsense but you are one of the sensible few who recognise it as such.

how are we to indicate that a datum is absent?

There is another option other than those you mentioned. Create a database schema in 5NF where each table represents one single type of proposition (without "optional" attributes) that you need to record in the database. It certainly does not follow that "no table may have more than one column which is not part of the primary key". This is not, after all, a radical suggestion. People were doing it well before Codd proposed using 4VL.

Codd in his many-valued logic model seemed content to gloss over a lot of difficult problems. A DBMS which implements Codd's RM V2 apparently requires a safety net feature that detects when it may give a wrong answer and corrects that answer. It's not obvious to me that such a feature is even possible, let alone practical. It also requires special versions of aggregate functions that allow the user to eliminate the null marks to avoid inaccurate results in calculations (although if they have to be removed from such calculations the user might well wonder why they need to be there in the first place!).

In any case, none of those problems is solved by SQL. SQL demonstrably does not give accurate and consistent results where nulls are involved and that is the reason why I and many others suggest avoiding and minimising use of nulls in SQL. No one as far as I'm aware has taken up Codd's suggestions in RM V2 and solved the unfinished problems needed to turn it into a complete and practical system (please correct me if I'm wrong). When someone has done that I will certainly be interested to evaluate it - though naturally I would still be highly sceptical about MVL in principle since as far as I can see it inevitably leads to results that don't accurately model real world situations.

On the other hand, systems based on RM V1 (including those that follow Date and Darwen's D specification) can already today model all information accurately and consistently - including missing information. They reliably produce correct results and there is a large body of research and evidence into their use. It is even possible to approximate that model in SQL (if you are careful about avoiding nulls and duplicates). It's much harder to see how to achieve Codd's RM V2 ideas in a SQL-based system.

It's interesting that it took about a decade after the publication of the Third Manifesto for popular implementations of it (like Muldis and Rel) to appear. RM V2 has been around longer and yet seems to be almost ignored even in academic literature, let alone by software developers and data management professionals. That doesn't make it "wrong" of course but maybe it does show that the problems it creates are insurmountable or inconvenient in practical use.

TomThomson SSC Guru Points: 104773 More actions · Answer 6

David Portas (7/31/2010)
Tom.Thomson (7/29/2010)
how are we to indicate that a datum is absent?
There is another option other than those you mentioned. Create a database schema in 5NF where each table represents one single type of proposition (without "optional" attributes) that you need to record in the database. It certainly does not follow that "no table may have more than one column which is not part of the primary key". This is not, after all, a radical suggestion.

As far as I recall, 5NF is about allowing nontrivial join dependencies only if each component relation is a projection onto a superkey. This doesn't help deal with the case where some data is absent. I don't understand why you think it does. Going to 6NF (I mean Date terminology 6NF, not Fagin's DKNF which was the first NF to be called 6NF) and allowing no non-trivial join dependencies at all might help in a sense - it makes it easier to have no NULLs in base relations, but 5NF doesn't do a thing for an absent datum. But even 6NF doesn't eliminate NULLs from derived relations and views, because you can't require derived relations to be in 6NF.

Can you tell me how to cope with the following case:

A database has to hold information about some things called screeks. Each screek is identified by a screekID, which is an integer. There will eventually be a few dozen thousands of screeks. A screek has 100 attributes A1, A2,....A100, each of which is a bit (ie 0 or 1), a given screek has only 1 value for each attribute, the attributes are pairwise independent, and for every attribute there can be screeks that have value 1 for it and screeks that have value 0 for it. I must have all screeks in my database. For some of the screeks I may not have discovered all the attribute values at the time I have to insert the screeks: eg perhaps for the screek whose screekid is 1 I don't yet have the value for attribute A1. There is no attribute whose value is guaranteed to be available to me for each screek as soon as it is necessary to put the screek in the database.

In RM-1 + single NULL + 3VL (Codd's original NULL proposal) I can represent this in a single relation with 101 columns and use nulls (and that certainly has no nontrivial join dependencies one of whose components is not a projection on a superkey), but you don't approve of that. Alternatively I can represent it as 100 relations each with 2 columns and have no NULLs in any of these relations (and all those 100 relations are certainly in both versions of 6NF). If I go for the 6NF form there's a little problem if I want to write a query that gives me all the information about some set of screeks identified by their ids. If I write the natural join of all 100 tables I get an empty relation as the result, which is not useful. If I write code to take each screek in the set and do the join of each of those relations in which it is present, after several centuries of coding (or a few hours of coding and then a few days on an extremely powerful supercomputer to generate the code for the query) I end up with lots of separate tables with lots of different column sets and a nightmare in the application to which the data is returned, as well as having a query which needs order of 10000000000000000000000000000000000 (10 to the power 34) lines of code (assuming I can write for example a conditional compounding 100 prepositions and a 99 way join on one line) in any relational algebra that I've ever come across. The obvious (to me) thing to do is to use an outer join of those tables, which potentially delivers me a relation in which at least one datum is missing and I have to represent the absence of that datum somehow: I have to have the column, because some of the maps in the relation do map that attribute name, it's just that one or more of the maps don't map it. What is your mechanism for representing the absence of that datum?

Sometimes I don't believe that proponents of the Date view on NULLs have any such mechanism, because they always either say it's solved by something which is actually completely irrelevant (as you did, since 5NF is clearly irrelevant to the issue of absent data) or claim that there's no need ever to represent an absent datum (which would be nice, but those of us who work in the real world have to deal with an absent datum quite often: pretending that we casn always wait until all the data can be made available is silly, although it is very frustrating when the absence of a datum is the result of management stupidity/ignorance/incompetence/arrogance). So even though I can avoid having absent values in the base relations (at the frightening cost of having things like 100 way outer joins) there isn't a way of avoiding them in derived relations (and that means they can't be avoided in Views either).

In this particular example, it's possible that most queries would be interested only in a small number of attributes, so the 6NF representation might improve performance of those queries enough to offset the hit of having to use a 100-way outer join for some queries (particular as it's an outer equijoin on primary keys). Or it might not - most people worry about 100-way joins, perhaps for good reason. I think that in practise I would stick with a base table whose non-key columns allow nulls. But this (100 columns where some data may be absent) is an extreme case, and usually it will be sensible to avoid nulls in base tables.

Date of course repeatedly claimed that using a defult value instead of NULL solves the problem, and clearly if one is prepared to put things into the database which are facts so far as the database is concerned but potentially false in the real world that the database describes then there is no need to worry about absent data - but if anyone builds any safety-critical software based on that concept I hope they are charged with criminal negligence when (not if) their product starts killing people.

People were doing it well before Codd proposed using 4VL.

How long is "well"? Fagin's 5NF paper was about the same time as Codd's presentation in Tasmania (both were early 79, not sure which was first) and that had multi-valued logic; and I'm not sure that was the first time that Codd proposed multi-valued logic; anyway, if people were using 5NF long well before Fagin's paper it seems somewhat strange.

Tom

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 7

David Portas (7/31/2010)
Tom.Thomson (7/29/2010)
If anyone attempts to use SQL's single null to indicate why data is missing they are being very stupid. Codd explicitly required NULL (and later the A_NULL mark) to carry no meaning other that a value is not present in the database. So it seems that you, myself, and Codd all agree on that point.
Thank you. I will save your comment and post a link to it the next time someone on SQLServerCentral tells me for the 1000th time that "Null means unknown SQL"! Obviously that is nonsense but you are one of the sensible few who recognise it as such.

Oh jeez! C'mon and get off your high horse, David. What's next? A lecture for hours to come on what you think the semantical differences between "missing" and "unknown" data is? Stop being so bloody condescending! If you're going to keep that crap up, get off my thread. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 8

Jeff Moden (8/1/2010)
Oh jeez! C'mon and get off your high horse, David. What's next? A lecture for hours to come on what you think the semantical differences between "missing" and "unknown" data is? Stop being so bloody condescending! If you're going to keep that crap up, get off my thread. 😉

It was Tom, not me, who made a point of saying that there was a difference. However, in the past I've made the point that SQL null is not the same as unknown and have been lectured at some length by other people claiming otherwise. All I did here was write two sentences agreeing with Tom when he also points out the same difference that I did. What's wrong with that? Do you disagree with me and with Tom on that point?

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 9

Maybe I have your intention wrong but look again. Tom said that a NULL could not be used to explain WHY data was missing (or unknown my book). You went off on the idea that NULL meant neither. Who's disagreeing with who? Sounds like you're talking about two different things. Then you went into the condescening-crap-mode by saying "Obviously that is nonsense" and implied that only Tom and a few other are "sensible" when it appears that even Codd disagrees with you.

Now... Without being condescending, why do you say that NULL does NOT mean UNKNOWN in SQL when you know neither the data nor the reason for the data to be missing?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 10

Jeff Moden (8/1/2010)
Maybe I have your intention wrong but look again. Tom said that a NULL could not be used to explain WHY data was missing (or unknown my book). You went off on the idea that NULL meant neither. Who's disagreeing with who? Sounds like you're talking about two different things. Then you went into the condescening-crap-mode by saying "Obviously that is nonsense" and implied that only Tom and a few other are "sensible" when it appears that even Codd disagrees with you.
Now... Without being condescending, why do you say that NULL does NOT mean UNKNOWN in SQL when you know neither the data nor the reason for the data to be missing?

I think you are being unfair and selective in commenting on my reply. Tom used the word "stupid" to describe people who use NULL to represent the reason why data is missing in SQL. Even though I agree with Tom that it's a stupid thing to do, I don't think I've ever called anyone stupid in this or any other forum. I accept it was probably just a rhetorical flourish on Tom's part and wasn't aimed at anyone in particular... but then you single out me for being condescending?!

The point both Tom and I agree on is that in a single null system like SQL, the null is often used to indicate the absence of a value but nothing can reliably be inferred about why it is absent. One reason why a value is absent may be that it is unknown but there can be lots of other reasons too. For instance nulls are returned in outer joins or in the sum of an empty set, which certainly are not unknown values. The null is just used as a placeholder for a value which is known not to exist.

So the way SQL uses it, null is not the same as unknown, it's used to mean a whole bunch of other things that have nothing to do with being unknown. If you assume that null means unknown then you will get the wrong results in SQL. Most obviously, X=X is always TRUE if X is unkown but is not true if X is null. QED.

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 11

Tom.Thomson (8/1/2010)
Can you tell me how to cope with the following case:

Quite simply I would have a relation for each unique type of screek. From what you have said that does not seem to imply anything like 6NF - but I only read your description quickly, so I'm sorry if I missed the point. It seems quite natural to me that each unique type of proposition should have its own relation.

The relevance of 5NF is that it means a design without join dependencies not implied by superkeys and also without nulls. I mentioned 5NF only because I wanted to make it clear that I was not talking about 6NF.

I don't have a problem with "outer joins" by the way. I think it's a good idea for a relational language to have a syntax shortcut for a join combined with a union. The outer join operator in SQL is unfortunately less powerful than it should be because it forces nulls upon the user for the union part of the query where it is normally much more convenient to return other values instead. Sadly, there is no generally reliable and concise way to do that in SQL so we are stuck with nulls. Of course those nulls always get removed for the end user because no one wants nulls in their end results. This again begs the question of why SQL (and Codd) should force nulls on us when we neither want or need them.

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 12

David Portas (8/1/2010)
Tom.Thomson (8/1/2010)
Can you tell me how to cope with the following case:
Quite simply I would have a relation for each unique type of screek.

Actually it's simpler than that. Applying Date and McGoveran's Orthogonal Design principle ensures that data isn't needlessly split between multiple relations. This simplifies things because logic that applies to any attribute doesn't need to be duplicated. So for example, instead of:

{screekID,A1,A2}

{screekID,A1,A2,A3}

{screekID,A1,A2,A4,A5}

For the same three types of screek I would do:

{screekID,A1,A2}

{screekID,A3}

{screekID,A4,A5}

screekID being the key in each case.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

David Portas (8/1/2010)
Jeff Moden (8/1/2010)
Maybe I have your intention wrong but look again. Tom said that a NULL could not be used to explain WHY data was missing (or unknown my book). You went off on the idea that NULL meant neither. Who's disagreeing with who? Sounds like you're talking about two different things. Then you went into the condescening-crap-mode by saying "Obviously that is nonsense" and implied that only Tom and a few other are "sensible" when it appears that even Codd disagrees with you.
Now... Without being condescending, why do you say that NULL does NOT mean UNKNOWN in SQL when you know neither the data nor the reason for the data to be missing?
I think you are being unfair and selective in commenting on my reply. Tom used the word "stupid" to describe people who use NULL to represent the reason why data is missing in SQL. Even though I agree with Tom that it's a stupid thing to do, I don't think I've ever called anyone stupid in this or any other forum. I accept it was probably just a rhetorical flourish on Tom's part and wasn't aimed at anyone in particular... but then you single out me for being condescending?!
The point both Tom and I agree on is that in a single null system like SQL, the null is often used to indicate the absence of a value but nothing can reliably be inferred about why it is absent. One reason why a value is absent may be that it is unknown but there can be lots of other reasons too. For instance nulls are returned in outer joins or in the sum of an empty set, which certainly are not unknown values. The null is just used as a placeholder for a value which is known not to exist.
So the way SQL uses it, null is not the same as unknown, it's used to mean a whole bunch of other things that have nothing to do with being unknown. If you assume that null means unknown then you will get the wrong results in SQL. Most obviously, X=X is always TRUE if X is unkown but is not true if X is null. QED.

We seem to be circling around semantics again. In my book I use unknown to mean "I know nothing of this, to include whether it should have a value", as opposed to "unknown value" (i.e. it SHOULD have a value, but for some reason we don't know doesn't).

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 14

Matt Miller (#4) (8/1/2010)
We seem to be circling around semantics again. In my book I use unknown to mean "I know nothing of this, to include whether it should have a value", as opposed to "unknown value" (i.e. it SHOULD have a value, but for some reason we don't know doesn't).

That may be a reasonable definition in other contexts but it's no help to anyone trying to understand SQL nulls. For example a null in an outer join means that a corresponding value definitely does NOT exist for that row. A null returned by the SUM of an empty set means zero (the null is returned instead of zero, zero being the right answer). In both cases SQL is not using null to mean "I know nothing of this, to include whether it should have a value". So your definition fails when applied to SQL.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 15

David Portas (8/1/2010)
Matt Miller (#4) (8/1/2010)
We seem to be circling around semantics again. In my book I use unknown to mean "I know nothing of this, to include whether it should have a value", as opposed to "unknown value" (i.e. it SHOULD have a value, but for some reason we don't know doesn't).
That may be a reasonable definition in other contexts but it's no help to anyone trying to understand SQL nulls. For example a null in an outer join means that a corresponding value definitely does NOT exist for that row. A null returned by the SUM of an empty set means zero (the null is returned instead of zero, zero being the right answer). In both cases SQL is not using null to mean "I know nothing of this, to include whether it should have a value". So your definition fails when applied to SQL.

A set with no members is what is called a NULL set. There is a convention (meaning arbitrarily set) on occasion to define its sum as 0, but there's no real foundation for it since the addition operation is defined as an operation between two values. The underlying reason for that is so that it doesn't have an impact on the summation of sets (some NULL some not), but - you end up with the same result if the operation ignores them or set them to 0.

One could argue for or against assigning 0 to the summing of a NULL set. That's the thing with conventions: they can go either way....

Same argument applies in OUTER joins. nothing qualifies, so it's back to a NULL set.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?