Second normal form (2NF) requires first normal form and one extra condition: if X is a set of prime attributes and A is a non-prime attribute and the value of A is determined by the values of X then X contains the whole of at least one candidate key. An attribute is prime if it is part of any candidate key, and non-prime otherwise. A Candidate Key is a set of columns that could reasonably be used as the Primary Key. The Primary Key is a set of columns which don't permit NULLs and which is guaranteed not to have the same set of values in two different rows, and for which no subset has these properties. This ensures that certain kinds of incorrectness can't creep into the database through erroneous inserts or updates though it cannot, of course, prevent all erroneous inserts and updates.
The extra condition for 2NF is sometimes informally expressed as saying that the value of every non-key column depends on the whole of the key. Saying that each column depends on the whole key means that the real business entities which the database models are such that the non-key attributes depend on all the key attributes, not on just some of them. This is an example of a real-world business rule being modelled in the structure of the database in order to ensure that certain errors can't occur, which is what all of 2nd, 3rd, 4th and 5th normal forms are intended to do.
A table not in 2NF
To see the kind of errors that are prevented by the use of 2NF we have to look at what can go wrong when a table is not in 2NF, which means we have to look at what kind of business rule can be violated if such a table is used. We will state a business rule and show how an update can cause a table not in 2NF to violate this rule, and then show how to change the table structure so that the rule cannot be violated.
We will use a Current_Assignment table, a table which shows for each employee working on a project what proportion of his time he spends on that project. This allows that time to be charged to the project. As charges are made in money terms, we need to know what the employee costs (salary and overhead). In this company salary is determined by the Pay Grade of the employee and his time in that grade (measured in years), and overheads are charged at 100% of salary. It is of course an absolute rule of the business that an employee can have, at any given time, only one Pay Grade and only one time in grade, and that every project that uses an employee in a given pay period sees the same charging rate for that employee.
The DBA at this company (actually the CFO's Personal Assistant, whose qualification for the DBA role is that once, a quarter of a century ago, she wrote a dbase II query) decides to put all the data needed into the Current_Assignment table, as this seems the simplest solution. She scripts the table like this:
CREATE TABLE Current_Assignment (
EmployeeID int NOT Null,
ProjectID int NOT Null,
TimeProportion NOT Null, -- percent of employee's time allocated to project
PayGrade int NOT Null,
YearsInGrade int NOT Null,
CONSTRAINT PK_Current_Assignment Primary Key Clustered
(ProjectID, EmployeeID)
)
Since an employee can be working on several projects, and each project can be using several employees, (EmployeeID ,ProjectID) is the only candidate key; so it has been assigned as the primary key.
Here the PayGrade and TimeInGrade columns are dependent on only part of the key (the EmployeeID); they are not dependent on all of it, because they have to be the same for a given employee whichever project he works on: so the table is not in 2NF. Let's see how a problem can arise as a result.
At some point in time, part of the contents of the table are as follows:
EmployeeID | ProjectID | TimeProportion | PayGrade | YearsInGrade |
---|---|---|---|---|
Bill Hacker | New MouseTrap | 50 | 15 | 3 |
Buck Bossman | New MouseTrap | 40 | 21 | 1 |
Penny Drudge | New MouseTrap | 100 | 12 | 2 |
Bill Hacker | Better Flypaper | 50 | 15 | 3 |
Buck Bossman | Better Flypaper | 50 | 21 | 1 |
David Drudge | Better Flypaper | 100 | 13 | 1 |
Charlie Fixit | Better Flypaper | 10 | 12 | 3 |
Buck Bossman | Roach Eliminator | 10 | 21 | 1 |
Charlie Fixit | Roach Eliminator | 90 | 12 | 3 |
Joe Bottom | Roach Eliminator | 100 | 9 | 12 |
Employee Buck Bossman (the group manager in charge of the projects and people shown) decides to give employee Charlie Fixit a pay raise, and updates the table so that the row with primary key (Roach Eliminator,Charlie Fixit) has PayGrade 13 and YearsInGrade 0. This happens because Buck associates Charlie with the Roach Eliminator Project, where he spends 90% of his time, and forgets that he also does some work on the Better Flypaper project. So now Charlie has different Pay Grade and Years in Grade in different rows of the table.
When a query is made of the updated table to discover Charlie Fixit's PayGrade and YearsInGrade, there are two rows with two different values, and it is a toss-up which pair of values will be delivered as the result. That's rather unfortunate if this is to be used to look up an actual salary in a table mapping grade and years in grade to money (Charlie will not be pleased if he gets paid at his old, lower, rate this month). When Charlie's time is charged to projects, Better Flypaper and Roach Eliminator will see different charge rates for the same person. The update has introduced a serious error into the database, violating a simple business rule.
Fixing the problem
If the schema is modified so that the table is in 2NF, this error can no longer occur. To effect this, the information which is dependent on only part of the key must be split out into a separate table, where that part key is the whole key. The new table definition scripts are
CREATE TABLE PayGrade_Detail (
EmployeeID int NOT Null Primary Key Clustered,
PayGrade int NOT Null,
TimeInGrade int NOT Null,
)
CREATE TABLE Current_Assignment (
EmployeeID int NOT Null REFERENCES PayGrade_Details (EmployeeID),
ProjectID int NOT Null,
TimeProportion NOT Null, -- percent of employee's time allocated to project
CONSTRAINT PK_Current_Assignment Primary Key Clustered
(ProjectID, EmployeeID)
)
The content corresponding to that shown above will now be
EmployeeID | ProjectID | TimeProportion |
---|---|---|
Bill Hacker | New MouseTrap | 50 |
Buck Bossman | New MouseTrap | 40 |
Penny Drudge | New MouseTrap | 100 |
Bill Hacker | Better Flypaper | 50 |
Buck Bossman | Better Flypaper | 50 |
David Drudge | Better Flypaper | 100 |
Charlie Fixit | Better Flypaper | 20 |
Buck Bossman | Roach Eliminator | 10 |
Charlie Fixit | Roach Eliminator | 80 |
Joe Bottom | Roach Eliminator | 100 |
EmployeeID | PayGrade | YearsInGrade |
---|---|---|
Bill Hacker | 15 | 3 |
Buck Bossman | 21 | 1 |
Penny Drudge | 12 | 2 |
David Drudge | 13 | 1 |
Charlie Fixit | 12 | 3 |
Joe Bottom | 9 | 12 |
PayGrade and YearsInGrade are now in a table where the only candidate key is (EmployeeID); and both are dependent on that key, not just on some part of it (in this case that's obvious: the key doesn't have multiple parts). So with this pair of tables instead of the original single table, all tables are in 2NF.
As a result, the update to change the PayGrade for employee Charlie now affects only one row, so it can't introduce contradictory information; the information is in one place only, so it can't have different values in different places. Changing to 2NF has eliminated some error-prone redundancy, and it is no longer possible to make the data fail to conform to the business rule that could be broken before.
Of course in this example there may well be an employee table into which the PayGrade information can be moved, instead of creating a new table; and people often think that 2NF is a lot of fuss about nothing when given examples like this, often because "no experienced DBA would make an error like that" and just as often because "you can fix it trivially by writing a single update statement which updates all the rows that should be affected"; the first is just plain wrong: people do make this mistake, and the second is not so easy if the app is to display a view on a screen and update is by overtype. If the view is a not-2NF relational view built by joining the two 2NF tables, an update through the view does indeed solve the problem, but that requires the base tables to be in 2NF.
Multiple candidate keys
The example above is simple partly because there is only one Candidate Key in the table. When checking for violations of 2NF it is essential to look at all Candidate Keys, not just the Primary Key, because the same error-prone redundancy can arise there too. If there is a UNIQUE constraint or index on a set of columns that don't permit NULL values that indicates that there is a Candidate Key other than the Primary Key. Of course the DBA may forget to specify a UNIQUE constraint when some column set is actually unique (and free of NULLs) according to the business rules (this is bad schema design - such constraints should always be specified to let the database prevent what errors it can, and having appropriate constraints can ensure that database operations can't cause this column set to contain duplicates) but that column set is still a Candidate Key because the business rules say it is, even though the constraint wasn't specified in the schema.