A Normalization Primer
For most DBAs, normalization is an understood concept, a bread and butter bit
of knowledge. However, it is not at all unusual to review a database
design by a development group for an OLTP (OnLine Transaction Processing)
environment and find that the schema chosen is anything but
properly normalized. In some cases we'll see denormalized structures that
will indeed make sense from a performance standpoint, however, these are few and
far between. Usually when we see a denormalized structure, it wasn't done
on purpose.
Then it's time to go back to the development group and start talking with
some of the personnel about normalizing the schema for integrity and performance
reasons. In some cases we'll get developers that know exactly what we're
talking about and after a short meeting, everyone will be on the same page.
But then there are those times when as DBAs we'll say "Let's talk about
normalization," and we find that while we're dealing with decent
developers, they've not learned this foundation for good database design. A lot of the time
is spent working through normalization basics in order to get this foundation down. Without
the foundation, any improvements we suggest,
even if met with acceptance, won't be met with an understanding of why. As
a result, the next time around the developers will commit the same mistakes,
meaning we'll go through the process all over again. And so we come to the
purpose of this article: a short primer on normalization concepts to ensure that
necessary foundation for good database design.
What is Normalization and Why Do We Need It?
Normalization solves several issues for us. First and foremost, when we
normalize a database, we attempt to do the following:
- Eliminate Unnecessary Redundancy of Data
- Preserve Data Integrity
- Protect from Data Loss
Let's look at a fictional case study which can demonstrate all of these
things. I and a couple of colleagues have formed the DBA Sports Talent
Agency, a newly christened company representing pro athletes. We've managed
to accumulate a small stable of top talent. Chad, one of the agents, has
put together an initial database table (primary keys are underlined throughout
the article) with all of our talent and some basic
information about each one:
Table 1 (ClientList)
Client | Sport | Team | Salary | Agent |
Smith, John | Football | Tennessee Titans | 2500000 | Kelley, Brian |
Brown, Reggie | Baseball | Kansas City Royals | 750000 | Silva, Chad |
Johnson, Kyle | Football | Arizona Cardinals | 500000 | Kelley, Brian |
Scott, Richard | Hockey | Dallas Stars | 350000 | Ruelas, Anthony |
Ferguson, Martin | Baseball | New York Mets | 125000 | Silva, Chad |
Elsworth, Robert | Football | Tennessee Titans | 450000 | Kelley, Brian |
I need to make a clarifying point on redundancy and I'll do it with this
example: If we take a quick look at teams, we see that the Tennessee Titans are listed
twice. This is a text field, and it could have easily have gotten mis-keyed.
I know quite a few people who were voting for the Tennessee Tuxedos (after a
cartoon penguin) instead of the Titans and a moment's flashback could have
caused a serious issue. After all, if we were to look at all actual names
of teams out there, the Tennessee Tuxedos is not a valid name. As a result, a
typo could cause a domain integrity issue. Large repeated text fields can
be a big headache (although a single occurrence could result in a domain
violation). But while this repetition can represent a problem, this isn't the redundancy we're talking about.
Let's say that our client, Mr. Reggie Brown, was a top notch running back
coming out of college. Spurning the NFL, he signs a baseball
contract. A little into his first year, he decides he wants to play both
sports, believing he can excel at both. Our sports agency sees the money
and we quickly get him signed with the Oakland Raiders, and he'll pick up a
million and a quarter for half a season. But based on
Chad's original design, we have an issue. Let's assume that the Client is
the primary key for the table. We can't insert another Client entry
with Brown, Reggie in it. We've got one of two options if we stick with
this single table design:
- Double up within the fields (Table 2)
- Create additional fields (Table 3)
Here's an example of the first case. Notice that we've separated the
information in the fields using a semi-colon. Ok, this doesn't seem so
bad, right? As we start looking to manipulate the data we find out
differently. First and foremost, querying for all we
have signed who play baseball isn't straightforward. I can't simply issue
a SELECT Client FROM ClientList WHERE Sport = 'Baseball' any
longer.
Table 2 (ClientList version 2)
Client | Sport | Team | Salary | Agent |
Smith, John | Football | Tennessee Titans | 2500000 | Kelley, Brian |
Brown, Reggie | Baseball; Football | Kansas City Royals; Oakland Raiders | 750000; 1250000 | Silva, Chad; Kelley, Brian |
Johnson, Kyle | Football | Arizona Cardinals | 500000 | Kelley, Brian |
Scott, Richard | Hockey | Dallas Stars | 350000 | Ruelas, Anthony |
Ferguson, Martin | Baseball | New York Mets | 125000 | Silva, Chad |
Elsworth, Robert | Football | Tennessee Titans | 450000 | Kelley, Brian |
A second issue is I can no longer calculate total salary numbers
easily either. If I want to calculate Reggie's, I have to parse the field
to get the two salary amounts. Chad realizes these two problems and redesigns
the table again. Chad now tries the second option,
which ends up looking like this:
Table 3 (ClientList version 3)
Client | Sport | Team1 | Salary1 | Agent1 | Sport2 | Team2 | Salary2 | Agent2 |
Smith, John | Football | Tennessee Titans | 2500000 | Kelley, Brian | ||||
Brown, Reggie | Baseball | Kansas City Royals | 750000 | Silva, Chad | Football | Oakland Raiders | 12500000 | Kelley, Brian |
Johnson, Kyle | Football | Arizona Cardinals | 500000 | Kelley, Brian | ||||
Scott, Richard | Hockey | Dallas Stars | 350000 | Ruelas, Anthony | ||||
Ferguson, Martin | Baseball | New York Mets | 125000 | Silva, Chad | ||||
Elsworth, Robert | Football | Tennessee Titans | 450000 | Kelley, Brian |
This works in that we're not having to parse fields, but now
we're looking at a second set of columns just for one athlete.
There's a lot of wasted space. We can
now find out who plays baseball, but we have to do an extra comparison (we have to
check Sport2 as well). This is the redundancy we're worried about.
Another point I made is that good normalization prevents data
loss. Let's say Richard Scott gets a career-ending injury. He and
the Stars (with our help) agree to a lump-sum payment but that takes him off our books. Notice that Richard is the only athlete
whom Anthony manages. If we drop Richard off the list, we drop any
reference to Anthony as well. This raises a crucial point: without a
client, an agent isn't listed. As a result, if we go and hire Anthony's
brother, Henry, to manage our basketball stable, he doesn't show
up on the table until he signs the first client. With this current design,
we can run into the case where we're not able to represent everyone on our
staff. Our denormalized database schema, though it is only one table,
is causing us some serious issues. Now let's look at the first three
normal forms and what that does for us.
First Normal Form (1NF)
To bring a database schema into first normal form, we ensure
that there are no arrays, no redundant columns, no repeated structures
representing the exact same data. The first two attempts Chad made at
updating our ClientList table violates first normal form. ClientList
version 2 has a repeating group in several of the fields for Mr. Brown. We've
talked about how difficult it would be to query our baseball athletes in this
table. Consider also the difficulty we face preserving integrity. We
can't do simple comparisons against a field, but are faced with parsing the
fields to verify integrity. This was the other point of where
normalization can help us. By ensuring fields are atomic, we simplify the
procedures for integrity checks within our database.
ClientList version 3 has a duplicate set of columns which represent the same
type of information as the first set. To be first normal form, we need to
eliminate this redundancy. One way to do this is to separate the client
specific information from the contract information. Chad decides to form two
tables. The first is Client, and he adds a column named ClientID. It's a simple integer assigned for each client.
This'll be a surrogate key instead of making the text field Client (which he renames ClientName) the primary key. It's easier for us to use, and
there's always the possibility of getting two John Smiths, meaning ClientName in
its current form can't be guaranteed to be unique.
Client Table
ClientID | ClientName |
1 | Smith, John |
2 | Brown, Reggie |
3 | Johnson, Kyle |
4 | Scott, Richard |
5 | Ferguson, Martin |
6 | Elsworth, Robert |
Contract Table
ClientID | Team | Sport | Salary | Agent |
1 | Tennessee Titans | Football | 2500000 | Kelley, Brian |
2 | Kansas City Royals | Baseball | 750000 | Silva, Chad |
2 | Oakland Raiders | Football | 1250000 | Kelley, Brian |
3 | Arizona Cardinals | Football | 500000 | Kelley, Brian |
4 | Dallas Stars | Hockey | 350000 | Ruelas, Anthony |
5 | New York Mets | Baseball | 125000 | Silva, Chad |
6 | Tennessee Titans | Football | 450000 | Kelley, Brian |
Chad has handled the repetition of field information by simply
creating a row for Reggie's second sport in the Contract table. The
primary key is ClientID and Team (because of the complex way salaries work, it's
entirely possible for a player to be playing for one team, yet still be owed
salary by another team, so we need to key on team and not sport), and we see that this does make all the rows
unique. Our database schema is now in first normal form, but we still have some
issues.
Second Normal Form (2NF)
Our database design is in second normal form if every attribute describes the entire primary key. Now,
a given attribute could describe a combination of the primary
key plus other fields, but the whole primary key is the key (pun intended) to
second normal form. It's
obvious that Sport is not dependent on ClientID. It describes Team.
If we look, we see that each Team in our list has only one agent. Each
client does not. According to our business rules, agents describe teams
(actually sports, but we'll get into that in 3rd normal form), not
clients. Along those lines, Reggie has two agents. We can see that
each team has only one agent (as demonstrated by the Titans), but clients can
have more. Chad takes this into consideration and separates Contracts further:
Contract Table
ClientID | Team | Salary |
1 | Tennessee Titans | 2500000 |
2 | Kansas City Royals | 750000 |
2 | Oakland Raiders | 1250000 |
3 | Arizona Cardinals | 500000 |
4 | Dallas Stars | 350000 |
5 | New York Mets | 125000 |
6 | Tennessee Titans | 450000 |
Team Table
Team | Sport | Agent |
Tennessee Titans | Football | Kelley, Brian |
Kansas City Royals | Baseball | Silva, Chad |
Oakland Raiders | Football | Kelley, Brian |
Arizona Cardinals | Football | Kelley, Brian |
Dallas Stars | Hockey | Ruelas, Anthony |
New York Mets | Baseball | Silva, Chad |
Third Normal Form (3NF):
Third normal form states that all attributes describe only the
primary key (no non-key fields). We just discussed above that while Agent
describes Team, it also describes Sport. Each agent concentrates on one
sport. So we should split the Team table even further, in order to bring
it to third normal form. By the way, this is the perfect opportunity to
add Henry to represent our basketball interests.
Team Table
Team | Sport |
Tennessee Titans | Football |
Kansas City Royals | Baseball |
Oakland Raiders | Football |
Arizona Cardinals | Football |
Dallas Stars | Hockey |
New York Mets | Baseball |
Sport Table
Sport | Agent |
Football | Kelley, Brian |
Baseball | Silva, Chad |
Hockey | Ruelas, Anthony |
Basketball | Ruelas, Henry |
Finally our database schema is in third normal form. First we've eliminated
the repeated fields and the repeated values in the fields. That brought us
to first normal form. Then we pulled out any fields that didn't describe
the entire primary key (attributes that might have described a combination of
the primary key and additional fields are valid because it's the entire primary
key that is the requirement). That got us to second normal form. Finally,
we pulled out anything that depended on the primary key + something else.
That got us to third normal form. If we want to see the fruits of our
labor, here's the final normalized design:
Client Table
ClientID | ClientName |
1 | Smith, John |
2 | Brown, Reggie |
3 | Johnson, Kyle |
4 | Scott, Richard |
5 | Ferguson, Martin |
6 | Elsworth, Robert |
Contract Table
ClientID | Team | Salary |
1 | Tennessee Titans | 2500000 |
2 | Kansas City Royals | 750000 |
2 | Oakland Raiders | 1250000 |
3 | Arizona Cardinals | 500000 |
4 | Dallas Stars | 350000 |
5 | New York Mets | 125000 |
6 | Tennessee Titans | 450000 |
Team Table
Team | Sport |
Tennessee Titans | Football |
Kansas City Royals | Baseball |
Oakland Raiders | Football |
Arizona Cardinals | Football |
Dallas Stars | Hockey |
New York Mets | Baseball |
Sport Table
Sport | Agent |
Football | Kelley, Brian |
Baseball | Silva, Chad |
Hockey | Ruelas, Anthony |
Basketball | Ruelas, Henry |
Final Thoughts
Normalizing databases is not very difficult, especially when compared with
potential issues in the future. It assists us with data integrity, the elimination
of redundancy, and the prevention of data loss. We've covered the first
three normal forms here, because usually when a database is described as
normalized it is at least 3NF. Bringing a database to third normal form
may take a little work, but it'll save us some headaches in the end. It is
important that any database developer or any developer writing code against a
database understand the first three normal forms.
For performance
reasons (too many joins, for instance) we may end up denormalizing data.
However, that is a decision that has to be weighed carefully. For design,
it's best to bring a database to at least 3NF and then denormalize from there (with
great caution).
Also, one thing I should point out with respect to teams is that I'm making
the assumption that location + name makes the team unique, and I've left them as
one field. A better design would have been to split location and team name. After all, the Los Angeles Lakers were once the Minneapolis
Lakers. Along those lines, it is entirely possible for two teams with the
exact same location and name to exist for two different sports. For
instance: St. Louis Cardinals and New York Giants. At this point in time,
I can't think of any cases where that is true. So to keep things
simple, I left Team as a combination of Location and Name, and left it as a
satisfactory primary key. If I had been really starting a sports agency, I would make
split up Team into location and name and I was also add the sport to ensure
uniqueness.
But for demonstration purposes, I chose to keep it simple.
Finally, as I stated in the introduction, normalization is important in OLTP environments. However, when dealing with OLAP (OnLine
Analytical Processing) and warehousing solutions, there are other structures
that may be more efficient for the tasks that will be performed (star schema,
for instance). Normalization is important in an OLTP environment where
changes are occurring on a frequent basis. A warehousing solution usually
involves static data that is denormalized for speed in reporting. It's important
to remember what environment we're speaking of with respect to applying
normalization. A simple case of the right tool for the right
job.