Relations are a branch of mathematics which deals with mappings among sets defined by predicate calculus from formal logic. Just like an algebraic equation, there are many forms of the same relational statement; but the "normal forms" of relations are certain formally-defined desirable constructions. The goal of Normal Forms is to avoid certain data anomalies that can occur in un-normalized tables. Data anomalies are easier to explain with examples. When Dr. Codd defined the relational model some additional requirements were added to the visualization of the relation as a table:
- If a row in a table can exist without a certain attribute being known, then this is considered a NULL state for that attribute and the DBMS must use a special symbol for that state. We have that in SQL, but NULLs have a data type to signal the SQL engine how to handle the physical storage.
- No two rows are identical. Furthermore it must be possible to define a subset of the known attributes of a relation which have no identical values. This subset of attributes is known as a "key". SQL was built on old file systems, and the language can allow redundant duplicates. But that is a sign of really bad programming.
- One key in a relation is the "primary key". This key cannot have any NULL attributes, of course. This initial definition was later modified to regard all keys as “equally key-like”, but the notion was set in the early literature and it got into SQL. The actual motive was from sequential files; a sequential tape file has to be sorted on a sort key to work. ISAM files on disk also had to have a sort key. This is why T-SQL has clustered indexes and and clustering as the default for the PRIMARY KEY. The new technology mimics to old technology, until it learns better and has its own voice.
- The order of rows and columns are irrelevant and do not contain any information. This leads us to the requirement that any atomic fact in the database (a collection of such tables) can be found by a table name, column names, and a primary key value.
Again, SQL assumes that it can go to the ANSI/ISO Standard Schema Information meta-data tables in order to get a default list of columns that it can then use to fill in the column lists in various statements. It is a shorthand.
When first being confronted with SQL, a programmer can easily assume that this is a physical ordering because that is how all of the files he has ever worked with are stored - physically contiguous tracks on a disk. Nope. Wrong. Columnar databases are not contiguous storage. Neither is a hashed database. The first thing you have to do to learn SQL is throw out a physical model mindset and think in abstractions.
Dr. Codd added quite a few other requirements (initially a total of 12 rules and eventually 333) but those listed here will suffice for our discussion.
First, Second, Third, and Boyce-Codd Normal Forms
Let's talk about going from a file system to an RDBMS. Let's start with a file to maintain data about class schedules at an imaginary school. We keep the course_nbr, section_nbr, time_period, room_nbr, professor_name, student_name, student_major, and course_grade. Assume we started with a Pascal file, declared like this, and wanted to move it to a relational database. I picked Pascal because it is relatively easy to read and understand.
course_nbr: ARRAY [1:7] OF CHAR;
action: CHAR;
time_period: INTEGER;
room_nbr: INTEGER;
room_size: INTEGER;
professor_name: ARRAY [1:25] OF CHAR;
Students : ARRAY [1:class_size]
OF RECORD
student_name ARRAY [1:25] OF CHAR;
student_major ARRAY [1:10] OF CHAR;
course_grade CHAR;
END;
END;
To convert this sequential file of records into a table, we first have to "flatten" it out, making sure that each row has full information for all the columns. Since Pascal, or any other language, does not have NULLs, the columns will be NOT NULL. The first thought is to simply rewrite every field over as a column, a line-for-line translation, as I said earlier. It does not really work because a column is nothing like a field; did you notice that “Students” is itself an array of records whose size is determined by an external variable? This would have been done with an OCCURS clause in COBOL and other programming language have similar features.
But let's do it anyway.
(course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
time_period INTEGER NOT NULL,
room_nbr INTEGER NOT NULL,
room_size INTEGER NOT NULL,
professor_name VARCHAR(25) NOT NULL,
student_name VARCHAR(25) NOT NULL,
student_major VARCHAR(10) NOT NULL,
PRIMARY KEY (course_nbr, section_nbr, student_name));
We declare PRIMARY KEY (course_nbr, section_nbr, student_name) so we have a proper table. But what we are doing is hiding the Students record array, which has not changed its nature by being flattened.
To get this table in 1NF we "decompose" it into two or more tables so that no table has repeating groups, hidden or otherwise.
(course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
time_period INTEGER NOT NULL,
room_nbr INTEGER NOT NULL,
room_size INTEGER NOT NULL,
professor_name VARCHAR(25) NOT NULL,
PRIMARY KEY (course_nbr, section_nbr));
(student_name VARCHAR(25) NOT NULL,
course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
student_major VARCHAR(10) NOT NULL,
course_grade CHAR(1) NOT NULL,
PRIMARY KEY (student_name, course_nbr, section_nbr),
FOREIGN KEY (course_nbr, section_nbr)
REFERENCES Classes (course_nbr, section_nbr));
FOREIGN KEY (student_name)
REFERENCES Students (student_name));
(student_name VARCHAR(25) NOT NULL PRIMARY KEY,
student_major);
At this point, we are in 2NF. Every attribute depends on the entire key. Now if a student changes majors it can be done in one place. Furthermore, a student cannot sign up for different sections of the same class because we changed the key of Enrollment. Unfortunately, we have still have problems.
Notice that while room_size depends on the entire key of Classes it also depends on room_nbr. If the room_nbr is changed for a (course_nbr, section_nbr) we may also have to change the room_size and if the room_nbr is modified (we knock down a wall) we may have to change room_size in several rows in Classes for that room_nbr. Obviously, the room size is a property of the room.
Another normal form can address these problems. A table is in Third Normal Form (3NF) if for all X ? Y where X and Y are columns of a table then X is a key or Y is part of a "candidate" key (a key which is not necessarily the primary key). This implies that the table in 2NF since a partial key dependency is a type of transitive dependency. To get our example into 3NF we make the following decomposition:
(room_nbr INTEGER NOT NULL PRIMARY KEY,
room_size INTEGER NOT NULL);
(course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
room_nbr INTEGER NOT NULL,
time_period INTEGER NOT NULL,
PRIMARY KEY (course_nbr, section_nbr),
FOREIGN KEY (room_nbr)
REFERENCES Rooms (room_nbr));
(student_name VARCHAR(25) NOT NULL,
course_nbr CHAR(7) NOT NULL,
section_nbr CHAR(1) NOT NULL,
course_grade CHAR(1) NOT NULL,
PRIMARY KEY (student_name, course_nbr),
FOREIGN KEY (course_nbr, section_nbr)
REFERENCES Classes (course_nbr, section_nbr),
FOREIGN KEY (student_name)
REFERENCES Students (student_name));
(student_name VARCHAR(25) NOT NULL PRIMARY KEY,
student_major VARCHAR(10) NOT NULL);
A common misunderstanding about relational theory is that 3NF has no transitive dependencies. As indicated above, if X ? Y then X does not have to be a key. If Y is part of a candidate key. We still have a transitive dependency in the example - (room_nbr, time_period) ? (course_nbr, section_nbr) - but since the right side of the dependency is a key in is technically in 3NF. The unreasonable behavior this table structure still has is that several courses can be assigned to the same room_nbr at the same time_period.
The normal form that actually removes all transitive dependencies is Boyce-Codd Normal Form or BCNF. A table is in BCNF if for all X ? Y, X is a key; period. We can go to this normal form just by adding another key with UNIQUE (room_nbr, time_period) constraint clause to the table Classes. There are some other interesting and useful "higher" normal forms but that is out of the scope of this discussion. In our example we have removed all of the important anomalies with BCNF.
Some people will argue that it is all right to "denormalize" a schema for efficiency reasons. If we denormalize the schema, then we are either not enforcing some of the user-required constraints or we have to write addition code to enforce the constraint. If we choose to write additional code, we have to duplicate the functionality in every application, present and future. Normalization reduces programming effort; rules are enforced in one place, one way, one time.
Anomalies
If you do not normalize a table, you get data anomalies. These are situations where data integrity cannot be maintained without extra code, NULLs and convoluted programming. Let's create a simple example. The table holds information about college students, their majors and their departmental advisers.
(student_name CHAR(10) NOT NULL,
major CHAR(10) NOT NULL,
adviser_name CHAR(10) NOT NULL,
PRIMARY KEY (student_name, major, adviser_name));
VALUES ('Higgins', 'Comp Sci', 'Celko'),
('Jones', 'Comp Sci', 'Celko'),
('Smith', 'English', 'Wilson');
UPDATE Anomaly
Let's try to update Higgins to change from Computer Science to Sanskrit and suddenly Celko is advising Sanskrit!
SET major = 'Sanskrit'
WHERE student_name = 'Higgins';
The actual erroneous row is ('Higgins', 'Sanskrit', 'Celko').
DELETE Anomaly
Mr. Smith drops out of school and Professor Wilson loses his job!
WHERE student_name = 'Smith';
The row ('Smith', 'English', 'Wilson') happens to be the only row that tells us that Professor Wilson is the English department's faculty adviser. If this sample table and the English department had been bigger, this problem could have been hidden for a long time. But we would have been carrying redundant data.
INSERT Anomaly
Mr. F. L. Wright starts an Architecture department, but we cannot put him into the database without using NULLs or a dummy student name.
VALUES (NULL, 'Architecture', 'Wright')'
SQL will prevent this row from getting into the table the ways I declared this example. However, look at the way noobs actually declare tables. The ignorant will declare an IDENTITY as the key and all other columns are NULL-able. They do not know that since the table's IDENTITY property is not a column nor an attribute of any possible data model, it cannot ever be used as a key. It is the count of the physical insertion attempts. It is a way for a noob to mimic the record numbers he knows from magnetic tapes and other sequential files.
Boyce-Codd Normal Form (BCNF)
A table can be in 3NF, but still have problems. Let's declare an abbreviated table for mailing labels.
(street_address VARCHAR(35) NOT NULL,
city_name VARCHAR(25) NOT NULL,
zip_code CHAR(5) NOT NULL
CHECK (zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]'),
PRIMARY KEY (street_address, city_name)
);
The nature of US mailing addresses and the ZIP code gives us these two functional dependencies, ignoring the fact that many cities have the same name within different states.
zip_code ? city_name
The first FD can give us a key, but we are not enforcing the second one. Consider this:
VALUES ('100 Main Street', 'Springfield', '99999'),
('100 Main Street', 'Yorktown', '99999');
In this example, we have to add another table to implement the second FD. Notice the overlapping unique constraints.
(city_name VARCHAR(25) NOT NULL,
zip_code CHAR(5) NOT NULL UNIQUE
CHECK(zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]',
PRIMARY KEY (city_name, zip_code));
Now modify the original table:
(street_address VARCHAR(35) NOT NULL,
city_name VARCHAR(25) NOT NULL,
zip_code CHAR(5) NOT NULL
CHECK (zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]'),
PRIMARY KEY (street_address, city_name),
FOREIGN KEY (city_name, zip_code)
REFERENCES Zip_Cities (city_name, zip_code));
This is a bit elaborate, but it is all declarative code. Using overlapping constraints is a good technique that does not get used as often as it should. Let's go to the next normal form. BCNF removes all the transitive dependencies, not 3NF.
Fourth Normal Form
A table in BCNF can still have problems with multi-valued dependencies. Here is another overly simple table.
(item_nbr CHAR(13) NOT NULL,
dept_name CHAR(15) NOT NULL,
supplier_name CHAR(20) NOT NULL,
PRIMARY KEY (item_nbr, dept_name, supplier_name));
It is all key, so we have a BCNF table. Now let's enforce the following two multi-valued dependencies:
item_nbr ?? dept_name
item_nbr ?? supplier_name
The MVDs say that an item can be in one or more departments --- every department has trash cans. The second multi-valued dependencies say that an item can have different suppliers –-- trash can come from many suppliers. But there is a problem with this table. Imagine that I get a new shiny red trash can. I cannot do anything with it until I find some department that wants a shiny red trash can. I need to recognize that MVD constraints on the departments and suppliers are independent.
(item_nbr CHAR(13) NOT NULL,
dept_name CHAR(15) NOT NULL,
PRIMARY KEY (item_nbr, dept_name));
CREATE TABLE Suppliers_Inventory
(item_nbr CHAR(13) NOT NULL,
supplier_name CHAR(7) NOT NULL,
PRIMARY KEY (item_nbr, supplier_name));
Fifth Normal Form
You can have a n-ary relationship that cannot be decomposed into lower degree relations. If you ever have had a mortgage or made any major purchase that involved a lender, then this Shelton will look familiar.
(buyer_name CHAR(15) NOT NULL,
seller_name CHAR(15) NOT NULL,
lender_name CHAR(15) NOT NULL,
PRIMARY KEY (buyer_name, seller_name, lender_name));
INSERT INTO Escrows(buyer_name, seller_name, lender_name)
VALUES ('Smith', 'Jones ', 'National Bank'),
('Smith ', 'Wilson', 'Home Bank'),
('Nelson', 'Jones ', 'Home Bank');
Let's try to split this into three binary relations.
(buyer_name CHAR(15) NOT NULL,
seller_name CHAR(15) NOT NULL
PRIMARY KEY (buyer_name, seller_name));
(buyer_name CHAR(15) NOT NULL,
lender_name CHAR(15) NOT NULL,
PRIMARY KEY (buyer_name, lender_name));
(seller_name CHAR(15) NOT NULL,
lender_name CHAR(15) NOT NULL,
PRIMARY KEY (seller_name, lender_name));
The trouble is that when you try to assemble the original information by joining pairs of these three tables together, thus:
FROM BuyerLender AS BL, SellerLender AS SL, BuyerSeller AS BS
WHERE BL.buyer_name = BS.buyer_name
AND BL.lender_name = SL.lender_name
AND SL.seller_name = BS.seller_name;
This will add false rows, such as ('Smith', 'Jones', 'Home Bank'). These three tables should have been implemented as VIEWs and not as base tables.
Other Normal Forms and non-NF Redundancies
There are more normalizations, but they involve temporal consideration and other things. There are also redundancies that involve more than one table. The most common non-normal form redundancy is a summary column in a referenced table. In English, imagine the usual skeleton of an Orders table and its referencing order details table:
(order_nbr CHAR(10) NOT NULL PRIMARY KEY,
..,
order_amt_tot DECIMAL (12,2) NOT NULL,
CHECK(order_amt_tot > 0.00),
..);
CREATE TABLE Order_Details
(order_nbr CHAR(10) NOT NULL
REFERENCES Orders(order_nbr)
ON UPDATE CASCADE
ON DELETE CASCADE,
sku CHAR(15) NOT NULL,
PRIMARY KEY (order_nbr, sku),
unit_price DECMAL (8,2) NOT NULL
CHECK (unit_price > 0.00),
order_qty INTEGER NOT NULL
CHECK (order_qty > 0),
..);
...where order_amt_tot is the total amount of the order as computed by the query:
FROM Order_Details
GROUP BY order_nbr;
People who write an Orders table like this will be updating it constantly, often with procedural code in a TRIGGER or stored procedure.
So, do we have to learn a lot of math just to get a correct schema? Well, yes. But you can use some basic heuristics and feel that you have gotten it right 95% of the time in the real world.
Mother Celko's Heuristics
Heuristics are exact and often sound a little funny when you say them. One of my favorites is “if you cannot think of it as a percentage, think of it as a ratio!”
From a mathematical view point, a ratio and percentage are the same. But percentages have a base while the two things in a ratio are seen as equal in some sense.
The prime heuristic is that a table must model either a set of one and only one kind of entity or one and only one relationship. This is what I call disallowing a “Automobiles, Squids and Lady GaGa” table. Following this rule will prevent MVD problems and it is the basis for the other neuritics. Let's do the rest of the heuristics as bullet points with questions and cute phrases to help you remember the principle involved.
- Does the entity table make sense? Can you express the idea of the table in a simple collective or plural noun? To be is to be something in particular; to be everything in general or nothing in particular is to be nothing at all (this is known as the Law of Identity in Greek logic). This is why EAV does not work – it is everything and anything.
- Do you have all the attributes that describe the thing in the table? In each row? The most important leg on a three-legged stool is the leg that is missing.
- Are all the columns scalar? Or is a column serving more than one purpose? Did you actually put hat size and shoe size in one column? Or store a CSV list in it?
- Do not store computed values, such as (unit_price * order_qty). You can compute these things in VIEWs or computed columns.
- Does the relationship table make sense? Can you express the idea of the table in a simple sentence, or even better, a name for the relationship? The relationship is “marriage” and not “man_woman_legal_thing”.
- Did you check to see if the relationship is 1:1, 1:m or n:m? Does the relationship have attributes of its own? A marriage has a date and a license number that does not belong to aether the husband or the wife. This is why we don't mind tables that model 1:1 relationships.
- Does the entity or relationship have a natural key? If it does, then you absolutely have to model it as the PRIMARY KEY or a UNIQUE constraint. Is there a standard industry identifier for it? Let someone else do all that work for you.
- If you have a lot of NULL-able columns, the table is probably not normalized.
- The NULLs could be non-related entities or relationships.
- Do the NULLs have one and only one meaning in each column?
- If you have to change more than one row to update, insert or delete a simple fact, then the table is not normalized.
- Did you confuse attributes, entities and values? Would you split the Personnel table into “Male_Personnel” and ”Female_Personnel” by splitting out the sex code? No, sex is an attribute and not an entity. Would you have a column for each shoe size? No, a shoe size is a value and not an attribute.
Many years ago, there was a magazine named Database Programming & Design which published a poster on Normalization by Marc Rettig as a subscription renewal premium. It was so popular that it is still around after all this time. It follows one example, the Daisy Hill Puppy Farm database (this is the birthplace of Snoopy in the Peanuts comic strip).
You really ought to have a copy hanging in your cube. It is that good.
http://www.informationqualitysolutions.com/FreeStuff/rettigNormalizationPoster.pdf