Before you start to think about your database schema or tables, you need to consider your data: The type of data it is, the scale you use for values. It needs to be unique, precise and unambiguous. Then you need to name it in such a way that it can be generally understood. Joe Celko explains.
Not too surprisingly the first step in database design is the data. But the truth is that too many programmers do not spend any time on designing their data and begin designing the schema first. It does not work. What SQL handles is structured data as opposed to unstructured data like text or pictures. One of the basic concepts in RDBMS is what Dr. Codd called the Information Principle. This rule states that all data in an RDBMS is modeled as scalar values in columns in rows in tables.
This means that all the rows in a table have exactly the same structure. That means a properly designed table cannot have one row that models automobiles, another that models squids and a third that represents Lady Gaga.
Likewise, this means that each column in a row is a value for the same data element. A column does not change from shoe size, to temperature to something else in the table. It also means that the scale does not change – if you measured temperature in Celsius, then all values in the column use the Celsius scale.
Picking the scales and data types is important. It makes sense to do math and comparisons with some scales and not with others. This also explains why SQL is a strongly typed language. Some languages are weakly typed -- that is, a variable can change data types during program execution -- and some are strongly typed; that is, a variable keeps the same data type and changes only values unless you explicitly cast it to a new data type. Not all castings are legal. The classic example of the problems with weak typing was PL/I which would convert entire arrays from FLOAT to COMPLEX if you assigned a single array element the square root of a negative numbers. There was no warning message when this happened. There is an old joke about weak data types:
Teacher: "Billy, what is 6 times 9?"
Billy: "Ahhh, red?"
Teacher: "NO! , Sally, what is 6 times 9?"
Sally: "Thursday?"
Teacher: "NO! Tommy, what is 6 times 9?"
Tommy: "54."
Teacher: "Right! Now tell the class how you got the answer."
Tommy: "I divided red by Thursday!"
People who write in weakly typed languages often put a prefix or postfix on data element names to tell them what the original data type was in case they change type in the program. This is a violation of the ISO-11179 metadata rules. This standard comes down to the idea that you name a data element for what it is by its nature and not where it is located in storage or how it is used in one table.
The ISO_11179 format is “[<role>_]<attribute>_<property>” for data elements. A data element has one and only one name in the entire schema. Better, it should have one and only one name in the enterprise. Best of all, a data element has one and only one name in the universe. Let me tell another old joke:
"When I was a kid, we had three cats."
"What were their names?"
"Cat, cat and cat."
"That sounds screwed up; how did you tell them apart?"
"Who cares? Cats don't come when you called them anyway!”
We want the data to come with we call it. Logic has the Law of Identity, which is often abbreviated as “A is A” in the literature. More formally, we say:
- To be is to be something in particular; an entity has an identity.
- To be nothing in particular is to be nothing at all; an entity is unique and precise.
- To be anything in general is to be nothing at all; an entity is not ambiguous.
For example, there is no such thing as “id” – it is too vague and generic (is it an identifier for automobiles, squids or Britney Spears?). It is a property without an attribute. A better name would be “vehicle_id” if that is the attribute property we have. The best data element name is “vin”, which is universally used and defined by the ISO 4030 “Vehicle Identification Number” Standard. VIN is precise and well-understood.
Likewise, you will see attributes without properties. My personal favorite is “sex” which could mean “sex_code” (defined by ISO 5218, which allows it to be referred to by the designator "SEX", tho this is not a good idea), “sex_type” (biological options for animals and plants) or “sex_frequency” (a bit of wishful thinking on my part).
Perhaps the silliest error is a chain of properties. Think about what a “type_code_id” means. If it is an identifier, then it is unique and belongs to one entity in the data model (think “emp_id”). If it is a code, then there is an external authority for it (think “postal code””). If it is a type, then there is a test for it (think “blood type”). This is like a chain of adjectives without a noun to modify. Adding an attribute to the chain will not cure the ambiguity problem.
Again, the basic rule is that a data element name tells us what it is. The name does not tell us about:
- Its locations in the schema with a table name
- How it is used in a particular table (i.e. no pk- or fk- or vw- affixes in the names).
- How its data is physically stored (i.e. no "i-", "str-" etc. affixes. for integers, strings, etc).
- Do not tell us what the data element is not. Be affirmative and precise.
The property component is picked from a standardized list which you can add to as needed. The list becomes part of your data dictionary and needs to be enforced.
The <role> is used when the same data element appears in two or more roles in the same table. For example,a report on an organization might have “supervisor_emp_id” and “subordinate_emp_id” both drawn from the employee identification “emp_id” data element in the Personnel table.
Watch the length of names. Data element names that are too short and too long are both bad. An overly short name can be hard to understand unless it is a standardized abbreviation. They can also be ambiguous; my favorite example was a system that had several ways to abbreviate “student” in data element names, one of which was “std” - often mistaken for “standard” and “sexually transmitted disease” in various places.
Avoid all special characters in names. Stick to letters in the Latin-1 alphabet, digits and underscores. They will port to other programming languages. Yes, data element names are not just for SQL. Likewise, avoid quoted identifiers with either the Microsoft-only square brackets or the ANSI/ISO double quote marks. This is just sloppy programming done for display formatting in the data base instead of the front end. The one possible exception might be a language translation problem where the Latin-1 alphabet will not work.
Enforce capitalization rules to avoid case sensitivity problems. My rules are that SQL keywords are uppercase, scalar data elements are lowercase and schema objects are capitalized. I give the research for this set of conventions in my book SQL PROGRAMMING STYLE, but I will skip the details here.
This list of standardized property postfixes for data element names is built on Teradata's internal standards and common usage in the trade press (CMP, MKP, and other trade publishers). These postfixes have exact meanings. If you need to invent your own, then they need to be exact, too.
"_id" = identifier. It is unique in the schema and refer to one entity anywhere it appears in the schema. Uniqueness is not the same thing as being an identifier. The number pi is unique, but it does not identify an entity. Never use "<table name>_id"; that is a name based on location and tell you this is probably not a real key at all. Just plain "id" is too vague and screws up your data dictionary when you have a zillion of them. Obviously, auto-incrementing values are not identifiers. I will get into that fallacy later.
"_date" or “_dt" = date temporal dimension. It is the date of something -- employment, birth, termination, and so forth. >There is no such column name as just a date by itself that would be really awful since DATE is a reserved word in SQL. That would be really awful since DATE is a reserved word in SQL
"_nbr" or “_num" = tag number; this is a string of digits that names something. Do not use "_no" since it looks like the Boolean yes/no value. I prefer "nbr" to "num" since it is used as a common abbreviation in several European languages and the combination of similar shaped letters in “_num” are visually confusing.
"_name" or “_nm" = this is an alphabetic name and it explains itself. It is also called a nominal scale.
"_code" or "_cd"= A code is a standard maintained by a trusted source, usually outside of the enterprise. For example, ZIP code is maintained by the United States Postal Service. A code is well understood in its context, so you might not have to translate it for humans.
"_size" = an industry standard or company scale for a commodity, such as clothing, shoes, envelopes or machine screws.
"_seq" = sequence, ordinal numbering. This is not the same thing as a tag number, since it cannot have gaps.
"_tot" = a sum, an aggregated dimension which is logically different from its parts.
"_tally" = a count of values, an aggregated dimension. Also called an absolute scale.
"_status" = an internal encoding that reflects a state of being that changes in a known pattern. Consider martial status. You have to be born, then change status to legal age. You cannot be married to several people at the same time. You can change from married to divorced, and from divorced to married. You cannot be married if you are dead.
"_cat" = category, an encoding from an external source that that requires an official judgment. For example, a Category Five hurricane. This is not like a code, which requires no such official judgment.
"_class" = an internal encoding that does not have an external source. A class is a set of things that have some commonality; you have rules for classifying an animal as a mammal or a reptile. You may have some cases where it is harder to apply the rules, such as the egg laying mammal in Australia, but the exceptions tend to become their own classification -- monotremes in this example.
"_type" = an encoding that has a common meaning both internally and externally. Blood types have a new-defined testing procedure. Types are usually less formal than a class and might overlap. A type is the weakest of the three, and it might call for a judgment. In some states a three-wheeled motorcycle is licensed as a motorcycle. In other states, as an automobile. And in some states, as an automobile only if it has a reverse gear.
The three terms are often mixed in actual usage. so stick with the industry standard, even if violates the definitions given above.
"_addr" or "_loc" = An address or location for an entity. There can be a subtle difference between an address and location. An address can refer to a street address or some external geographical system. Location might refer to an internal scheme such as a warehouse bin number. The bin number can stay the same while its physical location is changed.
"_img" = An image data type, such as .jpg, .gif and so forth. It might be important to use the particular format as the property.
Again, feel free to add to this list as needed. But track and standardize what you do.
The Series
Read about the rest of The Stairway to Database Design Series and view the other articles.