Third Normal Form (3NF) is the last of the three usually remembered normal forms defined back in 1971. Like second normal form (2NF), its purpose is to ensure that the database schema definition prevents certain errors from occurring.
The idea of third normal form is that the value of a non-key attribute should not be deducible from the values of other non-key attributes. A non-key attribute is an attribute that isn't in any candidate key. A candidate key is a set of attributes that could be used as the primary key.
Like 2NF, 3NF is about real world properties of the objects being modelled by data in the database. It is an example of a class of real world business rules being encoded in the schema definition so that the database can enforce them directly.
Definition of 3NF
The easiest definition to understand is
A relation R is in 3NF if and only if (i) it is in 2NF and (ii) whenever X is a set of attributes of R and A is a non-key attribute of R that is not in X, the value of A can be determined from the values of X only if X includes a candidate key.
Actually, the first clause of this could refer to 1NF instead of 2NF, because given 1NF it's clear that (ii) implies (i); but it's traditional to say for each normal form that it requires the next lower normal form, to avoid messy proofs.
That was not the original definition, although it defines exactly the same property. The original definition said that the relation must be in 2NF and must not have any non-trivial transitive functional dependencies. Using that terminology would require too much definition for an introductory note like this.
Why go to 3NF?
To see what sort of problems can be prevented by using 3NF, we need to define some business rules and show how they can be violated in a table that doesn't conform to 3NF but not when the schema is changed so that all tables conform to 3NF.
We'll use the same Pest Control Product company as in the 2NF article, but look at part of the sales operation rather than development this time. The Sales Division operates several subdivisions, determined by the type of pest that a product is used for. Three of these are the Insects group, the Weeds group, and the Rodents group. The groups don't share storage depots because some years back some insecticide accidentally got contaminated by weed killer and the resulting compensation costs were rather large when the roses died along with the greenfly. The company has decided that only one group will have a storage depot in any given city, and that each group will have a single telephone ordering system with a single toll-free number (which doesn't call a fixed location: it calls the depot belonging to the group that is nearest to the calling phone's area code). The CFO's Personal Assistant has designed thrown together a set of tables to support this, part of which is the Sales_Depot table, which at some point in time holds the following data (amongst others: neither all columns nor all rows are shown).
City | Group | Address | Order_phone |
---|---|---|---|
Toytown | Insects | 40 Tree Rd | 800765 |
Bigtown | Rodents | 3 Main St | 800210 |
Oldtown | Weeds | The Orchard | 800284 |
Newtown | Insects | 1 New Way | 800765 |
Anytown | Rodents | 4 Old Rd | 800210 |
Erehwon | Weeds | 2 South St | 800284 |
That all looks OK, until one day the phone system is upgraded to use 7 digit numbers instead of 6 digits, and everyone has to change. The company headquarters "carefully" checks that all the new numbers work before updating the database and telling the depots to advise their customers of the new numbers. But "carefully" was't carefully enough. When someone is calling out the new numbers from a list for someone else to type them in, the Rodents group number for Bigtown gets misheard, with a "01" instead of a "10" somewhere in it. The testing is done only using the data for the first city (in alphabetical order) in each group, so for the Rodents group Anytown is tested but Bigtown isn't and the error isn't spotted. Here is how that part of the table ends up, with the incorrect data highlighted in yellow:
City | Group | Address | Order_Phone |
---|---|---|---|
Toytown | Insects | 40 Tree Rd | 8007650 |
Bigtown | Rodents | 3 Main St | 8002010 |
Oldtown | Weeds | The Orchard | 8002840 |
Newtown | Insects | 1 New Way | 8007650 |
Anytown | Rodents | 4 Old Rd | 8002100 |
Erehwon | Weeds | 2 South St | 8002840 |
Therefore, 8002010 is the number sent out to customers of the Rodents group in Bigtown in a notice advising that the new number is available, and that some time after that, the old number will stop working. Of course all the customers carry on using the old number, until it stops working; and then they have a new number that doesn't work as expected (it is the number of a pet shop specialising in white mice). This is far from good for sales, for customer relations, or for the company's reputation as a careful and competent supplier.
The business rule states that a group has only one order phone, so the order phone number can be deduced from the Group attribute of the row, but here we have two rows with the same group but different order phones - the business rule has been broken. This can only happen because the order phone depends directly on the group according to the business rule, but the group attribute is not part of any candidate key so that means that the table is not in 3rd normal form. The two rows for the Rodents group have different phone numbers, which violates the business rule but does not violate any key constraints. The same order phone information has to be recorded more than once, in however many rows refer to the group that uses that phone. This failure to achieve 3NF means that the datum can end up being different in two places, despite the business rule saying it must not be. This is unwanted and dangerous redundancy, and permits inconsistencies that can have undesirable consequences.
The 3NF Version
To get this part of the schema into 3NF, it is necessary to move the data that is dependent on something other than the key into a separate table where what it is dependent on is the key. Removing that column from the Sales-Depot table leaves this (before the phone numbers update)
City | Group | Address |
---|---|---|
Toytown | Insects | 40 Tree Rd |
Bigtown | Rodents | 3 Main St |
Oldtown | Weeds | The Orchard |
Newtown | Insects | 1 New Way |
Anytown | Rodents | 4 Old Rd |
Erehwon | Weeds | 2 South St |
A new table, the Sales-Group table is needed to hold the Order_phone information; this looks like this
Group | Order_phone |
---|---|
Insects | 800765 |
Rodents | 800210 |
Weeds | 800284 |
Group | Order_phone |
---|---|
Insects | 8007650 |
Rodents | 8002100 |
Weeds | 8002840 |
Now when the phone number update takes place, the order phone number for the Rodents sales group is only in one place in the database, so inconsistencies like the one described above can no longer occur. It is either right or wrong, and because it is in only one place it can't be right in one place but wrong in another.
Of course, errors can be avoided by insisting that things are updated only using stored procedures that update this column for every row containing the affected group but that means winning the fight to ban ad-hoc updates, which is often unwinnable. Perhaps they can be avoided by having a trigger that updates all the remaining rows that contain this number whenever one of them is updated, but that requires you to make sure RECURSIVE TRIGGERS is OFF, which may not be acceptable. Of course these approaches only work when there not too many rows. If a large number of rows are affected, there may be a performance issue as well. Besides, it is usually a bad idea to write extra code to fix problems that wouldn't exist if you chose the right table structures.
Another thing that isn't made clear by a small scale example like this one is that going to 3NF often makes a significant reduction in the size of the data, because it reduces data redundancy. With real world data, enforcing 3NF can sometimes deliver extra performance because the IO volume is vastly reduced and, when store is in short supply, it gives even more extra performance because working sets are reduced.