First Normal Form (1NF) is the most basic normal form of relational database theory. Its purpose is to ensure that the database system has data that it can manipulate in a straightforward manner, so that querying, inserting, and updating data is not an unreasonably complex task for the database system. Essentially, it is a requirement that the individual units of data that the database is required to work with can always be treated by a database system as being free of any structure other than that defined by the system of data types understood by the system. An application may choose to store data that encodes extra structure in the database. Because such extra structure is not inherent in the data types supported by the database system, the database must not be required to do anything that needs knowledge of that structure.
1NF is the only one of the commonly recognised normal forms that is about the structure of a table per se, rather than about how that structure should relate to the rules of the world being modelled by the tables. 1NF doesn't care about the external world. It cares only about database internals. In effect, 1NF imposes two rules, a rule that the database doesn't need to know about anything other than database internals and another rule that each table must have a primary key.
It is sometimes claimed that it is impossible to violate 1NF in SQL, because SQL enforces 1NF. This is only true with a very narrow technical interpretation of 1NF under which many structures which most people regard as violating 1NF would be regarded as not violating it - a case of following the letter but not the spirit, some might say (I would say that neither the letter nor the spirit is being followed, since SQL does not force you to have a primary key).
The Definition of First Normal Form
Here is the definition of First Normal Form
A table is in 1NF when every column has a fixed simple data-type that is understood by the database system, and every row has a primary key.
Sometimes people use different definitions. This isn't because they are thinking of different concepts, but because they are emphasising different aspects of this definition and/or because they make different assumptions about what is already required in the data model for it to be relational (or for it to be a valid SQL table definition) . For example, people may leave off the primary key requirement because they know that the base relations of relational database theory are required to have a primary key; but that's a circular argument, as a relation in relational database theory is required to conform to 1NF. People may add a requirement that all rows have the same columns; but that really is part of the definition of a relation (even a mathematical relation, let alone a relational database relation). The "single simple value" is sometimes called an "atomic value", and sometimes the concept is expressed in terms of repeating groups. A repeating group is a value that is non-simple because it is a set of values - effectively the column contains more than one value in this row.
A Table which is not in 1NF
Tables that are not in 1NF ought to be rare, but they are a regular feature of badly designed SQL databases. First, we have tables without a primary key. These won't be further discussed here, except to say that every base table should have a primary key (temporary tables are maybe less of a concern). Then we have tables where the database system has to understand some attribute that has a complex type not understood by the system - a developer has chosen to encode some complex structure in a string, perhaps, and then decided that he needs to use parts of that structure in database operations.
So let's look at a table that might or might not violate the simple data- type rule of 1NF. Here are two rows from that table
Name | Telephones | Marital Status |
---|---|---|
Bill | 121 643 4321, 121 643 4497, | Married. Mary. 5th June 1985. |
Frank | 121 643 4324, 121 643 4497, 119 861 8183 | Single |
Is this or isn't it in 1NF? That depends on what the database is expected to do with it.
With the telephones column there are two possibilities.
Perhaps all the database has to do with the telephones column is deliver the values it holds to an app and allow an app to overwrite the value in that column with a new value. Then it is up to the application and not the database to decode that value into three numbers and know what they are (office phone, office fax, home phone). It is also the app's job to encode the three numbers into a value to put into the column - the database doesn't know anything about that encoding. The column is just a simple string value so far as the database is concerned. In that case there is no violation of 1NF by the telephones column - it's a repeating group as far as the application is concerned, but the database system doesn't need to know that.
The other possibility is that you want decoding and encoding done in the database instead of in the application. There will have to be code in queries to decode that complex string and extract the different numbers; and even then you won't get an efficient search on any of the numbers, as there's no way that decoding operation will permit use of an index to aid the search. The table violates 1NF, and the result is unpleasantly complicated query code and a heavy performance penalty.
With the Marital Status column, there are again two possibilities. Maybe the encoding and decoding of that string is always in the app, and the database just sees a single varchar value, in which case there is no violation of 1NF. Or perhaps the database will be asked to solve questions like "list all people who are married to someone less that 21 years old", in which case it has to decode that column and 1NF is violated again - and again there are code complexity and performance penalties.
Changing the table to be 1NF
Let's look at what to do in the case where both these columns are breaking the "single simple value" rule of 1NF because the database is expected to handle the decoding and encoding.
First the telephones column. If everybody has an office phone, an office fax number, and a home phone we can just turn the single telephones column into three columns: OfficePhone, OfficeFax, and HomePhone. What if some people don't have an office fax number? Some people would still use the three columns, using a silly number like 000 000 0000 to indicate "no such number". That's better than using NULL to indicate "no such number". Usually a still better solution (better by far) is to introduce a new table OfficeFax that has a name column and a FaxNumber column - someone who doesn't have an office fax number doesn't show up in this table, so the use of NULLs or silly values is avoided.
Next the Marital Status column. This column could be split into SpouseName and SpouseDoB but that would probably end up using NULL (or a zero-length string, which is marginally better than NULL) in the SpouseName column to indicate "unmarried". It's no good trying a silly string like 'zqxrdlsprrgj' because inevitably some spouse will legally change his or her name to 'zqxrdlsprrgj' just to mess with the system. This NULL or zero length string will also tell you to ignore the SpouseDoB column, because it's irrelevant. Another possibility is to keep the Marital Status column as well as the SpouseName and SpouseDoB column, and 'Single' in that column tells you to ignore the other two. Again, having an extra table MaritalStatus with columns Name, SpouseName, SpouseDoB will usually be the best bet. Names of unmarried people won't occur in that table, so names and DoB of nonexistent spouses don't have to be recorded as NULL or as anything else.
Nulls and 1NF
Don't let anyone tell you that table which has a nullable column does not, for that reason alone, conform to 1NF; this is a claim made only by people who don't understand domain theory. Equally, don't let anyone tell you that you should use NULLs thoughtlessly to remove repeating groups (or other non-simple types) in cases like those described above. Use of NULL in cases like these is really an attempt to invent a table where different rows have different numbers of columns, and should never be done without very careful evaluation of the alternatives. When such evaluation takes place, it will almost always indicate that using NULL is not the best solution.