There are several types of tables, each with their special requirements for rules and integrity constraints. Whatever the requirement, table-level constraints will ensure that the rules are enforced and data integrity is maintained.
In level one, we named data elements for what they are and classified them. In level two, we modeled the data elements with data types and constraints in SQL to give us rows. In level three, we are going to put these rows into tables. A table is more than just a bunch of rows collected together under one name.
A column can appear only once in a table. That only makes sense; if you recorded someone's shoe size twice, it would be redundant at best and ambiguous when the columns disagree at worst. Now we can have table level CHECK constraints among the columns of each row. They are not really that much different than the one column CHECK constraints we had before. They can be named and will appear in the list of column declarations in the CREATE TABLE statement, not attached to any row. . For example:
CHECK (emp_birth_date < emp_hire_date)
It is often a good idea to not combine constraints into one huge CHECK () clause. The error messages will include the constraint name, so separate constraints will give you a better idea of what went wrong than a single monster named “Bad_Things_Happened” constraint.
Continuing our hatred of redundancy, at the table level we want each row to be unique for the same reasons. This can be done with table constraints. The two table level constraints are UNIQUE and PRIMARY KEY, which come in both single and multiple column versions.
The UNIQUE constraint says that the column or combination of columns is unique within the table. But if there is a NULL in one or more of the columns, we will allow it as if it were a unique value. The PRIMARY KEY declaration has the same effect as a NOT NULL and UNIQUE for all the columns in it. But for historical reasons, a table can have only one PRIMARY KEY declaration. These columns are used as defaults for other constraints between tables, but don't worry about that for now.
How uniqueness constraints are used depends on the type of table involved. Broadly speaking, we can classify a table as one of three kinds:
- Entity
- Relationship
- Auxiliary
An entity table is a set of things of the same kind which are defined by the attributes modeled by the columns. Each row is an instance of that kind of thing. Each row has the same columns. If you can see it feel, see or touch it, then it is an entity. The name of a entity table should not be singular (unless there really is one only member of this set) because it models a set. The name needs to be plural or, if possible, collective. For example, “Employee” is bad, “Employees” is better and “Personnel” is best. “Tree” is bad, “Trees” is better and “Forest” is best. You can add your own examples.
Entities are also classified as weak or strong. A strong entity exists on its own merit, while a weak entity exists because of one or more strong entities. You need to have a purchase before you can have a discount.
A relationship table references one or more entities and establishes a relationship among them. A relationship can have attributes of its own in addition to references to entities. The marriage license number belongs to the marriage, not the husband, wife or minister.
The degree of a relationship is the number of entities in the relationship. Binary relationships have two entities and we like them in the real world because they are simple. A recursive binary relationship relates an entity to itself. The general n-ary relationship involves n entities, such as a home mortgage with a buyer, seller and lender. It is not always possible to decompose n-ary relationships into binary relationships. Membership in the relationship can be optional or mandatory. The optional membership means we can have zero entities of one kind – a purchase does not always get a discount.
The cardinality of a relationship is the actual number of related occurrences for each of the two entities. The basic types of connectivity for relations are: one-to-one, one-to-many, and many-to-many. These terms are usually qualified with optional (0 or more) or mandatory (1 or more) memberships.
A one-to-one (1:1) relationship is when at most one instance of an entity A is associated with one instance of entity B. For example, take the relationship between a traditional husband and wife. Each husband has one and only one wife; each wife has one and only one husband. Both are mandatory in this example.
A one-to-many (1:n) relationship is when for one instance of entity A, there are zero, one or many instances of entity B but for one instance of entity B, there is only one instance of entity A. An example might be that a department has many employees; each employee is assigned to one department. Depending on your business rules, you might allow an unassigned employee or an empty department.
A many-to-many (m:n) relationship, sometimes called non-specific, is when for one instance of entity A, there are zero, one, or many instances of entity B and for one instance of entity B there are zero, one, or many instances of entity A. An example might be pizzas and customers.
An auxiliary table is neither an entity nor a relationship; it provided information. They are things like calendars or other look up tables that replace computations in SQL. They are often misunderstood and get treated like entity or relationship tables.
Let's make this more concrete. A sales order is a relationship between a customer (entity) and our inventory (entity). The order details are a weak entity that exists because we have an order. The relationship has an order number that is not part of the inventory or customer. The shipping costs are obtained from an auxiliary table. Here are some skeleton tables for this example. I am using the GTIN (Global Trade Item Number) for order items and the DUNS (Data Universal Numbering System) for the customers. Always look for industry standards when you design a database.
CREATE TABLE Sales_Orders
(order_nbr INTEGER NOT NULL PRIMARY KEY
CHECK (order_nbr > 0),
customer_duns CHAR(9) NOT NULL,
order_shipping_amt DECIMAL (5,2) NOT NULL
CHECK (shipping_amt >= 0.00),
etc);
CREATE TABLE Sales_Order_Details
(order_nbr INTEGER NOT NULL,
gtin CHAR(15) NOT NULL,
PRIMARY KEY (order_nbr, gtin),
item_qty INTEGER NOT NULL
CHECK (item_qty > 0),
item_unit_price DECIMAL (8,2) NOT NULL
CHECK (item_unit_price >=0.00));
CREATE TABLE Customers
(customer_duns CHAR(9) NOT NULL PRIMARY KEY
CHECK (customer_duns LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'),
etc);
CREATE TABLE Inventory
(gtin CHAR(15) NOT NULL PRIMARY KEY
CHECK (gtin LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'),
onhand_qty INTEGER NOT NULL
CHECK (onhand_qty >= 0),
We can see that the Sales Order is a relation between the Customers and Inventory. Orders have their own key (order_nbr) but there is nothing to force us to use only valid customer DUNS numbers or product GTIN codes for things we actually have in inventory. In fact, I can insert obviously invalid DUNS and GTIN codes into the Orders table the way things are declared now.
This is where the REFERENCES clause comes in. It is what lets us enforce all that cardinality and degree stuff from the data model. A reference is not a link or a pointer. Those were physical concepts and reference is a logical concept and we have no idea how it is implemented. What it enforces is a rule that the referencing table columns match a single row in the referenced table. That means the row in the referenced table has to be unique; by default, the PRIMARY KEY in the referenced table is the target, but it dos not have to be. The values in the referencing table are called Foreign Keys – they are not keys in their table, but somewhere else in the schema.
Here is the skeleton schema with more flesh on it:
(order_nbr INTEGER NOT NULL PRIMARY KEY
CHECK (order_nbr > 0),
customer_duns CHAR(9) NOT NULL
REFERENCES Customers(customer_duns),
order_shipping_amt DECIMAL (5,2) DEFAULT 0.00 NOT NULL
CHECK (shipping_amt >= 0.00),
etc);
CREATE TABLE Sales_Order_Details
(order_nbr INTEGER NOT NULL
REFERENCES Orders(order_nbr),
gtin CHAR(15) NOT NULL
REFERENCES Inventory(gtin),
PRIMARY KEY (order_nbr, gtin),-- two column key
item_qty INTEGER NOT NULL
CHECK (item_qty > 0),
item_unit_price DECIMAL (8,2) NOT NULL
CHECK (item_unit_price >= 0.00));
CREATE TABLE Customers
(customer_duns CHAR(9) NOT NULL PRIMARY KEY
CHECK (customer_duns LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'),
etc);
CREATE TABLE Inventory
(gtin CHAR(15) NOT NULL PRIMARY KEY
CHECK (gtin LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'),
onhand_qty INTEGER NOT NULL
CHECK (onhand_qty >= 0),
etc);
Notice that we only had to have CHECK () constraints on the places where DUNS and GTIN were keys, not where they appear in the referencing tables. The entity tables, Customers and Inventory, are referenced; the relationship table, Orders, references other tables. This is a general pattern, but it is not set in concrete.
The multiple column form of this clauses looks like this:
REFERENCES Sales_Order_Details(order_nbr, gtin)
The columns in the FOREIGN KEY sub-clause are in the referencing table have to match the referenced key, column for column, but could have different names. I can get the 1:1, 1:n and n:m relationships by placing uniqueness constraints in the right places. As an example of an axillary table, we might compute the shipping costs based on the total value of the order. The table could look like this:
CREATE TABLE Shipping_Costs
(start_order_amt_tot DECIMAL (10,2) NOT NULL,
end_order_amt_tot DECIMAL (10,2) NOT NULL,
CONSTRAINT Valid_Shipping_Range
CHECK (start_order_amt_tot < end_order_amt_tot),
PRIMARY KEY (start_order_amt_tot, end_order_amt_tot),
shipping_amt DECIMAL (5,2) NOT NULL
CHECK (shipping_amt > 0.00));
While we have declared a PRIMARY KEY on the auxiliary shipping costs table, it is not like the keys for entities – there is no validation or verification, it is not an identifier. To use this table, we will query with something like:
SELECT shipping_amt
FROM Shipping_Costs
WHERE <order amount total> BETWEEN start_order_amt_tot AND end_order_amt_tot;
As an exercise, try to write a constraint that will prevent the start and end ranges from overlapping and from having gaps. You can re-design the table if you need to.
In the revised skeleton schema, when you try to take an order for a product that is not in inventory, you will get an error that says, in effect, “it is out of stock!” and you can try something else. But if you try to delete a product from Inventory, you will also get an error says in effect, “Hey, someone ordered this junk”, so you have to go to each order and replace the item with something else or make it NULL (if allowed) before you can delete it from Inventory.
This is where Declarative Referential Integrity (DRI) actions are used. The syntax is:
ON DELETE [NO ACTION | SET DEFAULT | SET NULL | CASCADE]
ON UPDATE [NO ACTION | SET DEFAULT | SET NULL | CASCADE]
The delete and update are called “data base events”; when they happen to the table, then the DRI action occurs.
- NO ACTION = the transaction is rolled back and you get a message. This is the default when you just have a simple REFERENCES clause.
- SET DEFAULT = the referenced column(s) are changed by the event, but the referencing columns are changed to their default values. Obviously, the referencing columns need to have defaults declared on them. Those defaults have to be in the referenced table.
- SET NULL = the referenced column(s) are changed by the event, but the referencing columns are changed to NULLs. Obviously, the referencing columns need to be NULL-able. This is where the “benefit of the doubt” for NULLs comes in.
- CASCADE = the referenced column(s) are changed by the event, and those same values are cascaded over to the referencing columns. This is the most important option in practice. For example, if we want to discontinue a product, we can delete it from Inventory and the ON DELETE CASCADE will make the SQL engine automatically delete the matching rows in the Sales_Order_Details. Likewise, if you update an item in Inventory , ON UPDATE CASCADE will autocratically replace the old value with the new wherever it is referenced.
After any of these actions are performed, the referential integrity constraints are still valid. Here is the final skeleton:
(order_nbr INTEGER NOT NULL PRIMARY KEY
CHECK (order_nbr > 0),
customer_duns CHAR(9) NOT NULL
REFERENCES Customers(customer_duns)
ON UPDATE CASCADE
ON DELETE CASCADE,
order_shipping_amt DECIMAL (5,2) DEFAULT 0.00 NOT NULL
CHECK (shipping_amt >= 0.00),
etc);
CREATE TABLE Sales_Order_Details
(order_nbr INTEGER NOT NULL
REFERENCES Orders(order_nbr)
ON UPDATE CASCADE
ON DELETE CASCADE,
gtin CHAR(15) NOT NULL
REFERENCES Inventory(gtin)
ON UPDATE CASCADE
ON DELETE CASCADE,
PRIMARY KEY (order_nbr, gtin),-- two column key
item_qty INTEGER NOT NULL
CHECK (item_qty > 0),
item_unit_price DECIMAL (8,2) NOT NULL
CHECK (item_unit_price >= 0.00));
See if you can figure out what happens when:
- A customer dies and we delete him.
- We change the Lawn Gnome statue to the more tasteful Pink Flamingo.
- We discontinue the Pink Flamingo.
- Someone tried to order a Lawn Gnome after steps 1 to 3
Obviously, I am leaving out restocking issues and other things, but we will get to those.
The Series
Read about the rest of The Stairway to Database Design Series and view the other articles.