Introduction
Creating databases can be scary. The Internet is awash with rules, Best Practices, and advice from "experts" to experts. But the truth is, anyone with a little patience and the proper tools can create a database. In this series of articles, we will cover database creation from a practical (and more real-world) viewpoint. These articles do assume some basic knowledge of SQL Server and T-SQL.
Database Ho!
Rather than rehashing AdventureWorks, we'll start with a simple passion-project of mine: the comic book database. I have a lot of comics, new and old. This occasionally leads to the buying of duplicates when I am filling in my collection gaps, or worse, behind on my reading. To avoid buying duplicates, it makes sense to start a list. This is a simple project. Comic books have three basic features. This being the case, we won't bother with a database model. Who needs it? I'm not tracking every comic book ever written, just the ones I've purchased. Despite the physical size of my collection, the DB probably won't be very big, so we'll start at 50 MB and size up from there if we really do need the space.
It's a good practice to start off with a DROP statement before the CREATE statement. Unless there is a table or database where the data must be kept, I avoid ALTER. SQL Server doesn't track last changes to objects, it tracks create dates. Using DROP / CREATE statements allows us to see the last time someone updated the object and has the added benefit of keeping the server clean. Not to mention trying to CREATE an object that already exists tends to cause SQL Server errors.
IF (SELECT Name FROM sys.databases WHERE Name = 'ComicBooks') IS NOT NULL DROP DATABASE ComicBooks; GO
Now it's time to create the database. I give everything user-friendly file names. This method helps when surfing File Explorer or glancing at the contents of sys.files. At a glance, we can see what files go to which database if the files are main data (.MDF), secondary data (.NDF), or log (.LDF) files. SQL Server doesn't really care about the extensions. We could name a file ABCD.Myx if we wanted to. But if we change the extensions from .MDF, .NDF, or .LDF, we risking a lot of confusion when we forget the meaning of the non-standard extensions. Plus it adds to the support headaches when our successors come along to solve the problems we created.
CREATE DATABASE ComicBooks ON PRIMARY (NAME = N'ComicBooks_Data', FILENAME = N'D:\ComicBooks_Data.MDF', SIZE = 10MB, MAXSIZE = UNLIMITED, FILEGROWTH = 100KB) LOG ON (NAME = N'ComicBooks_Log', FILENAME = N'D:\ComicBooks_Log.LDF', SIZE = 5MB, MAXSIZE = UNLIMITED, FILEGROWTH = 100KB) GO
Setting the Table(s)
After creating the database, we need to determine what we're going to populate it with. The obvious answer is "comic books." But let's think about this for a moment. I earlier said there are three essential attributes (pieces of information) to each comic book:
- who publishes it
- what the title is
- what the issue number is.
These are the attributes used by comic shops and collectors to identify and discuss comics. Unfortunately, those attributes aren't unique enough to be a true key. Comic series get canceled and later restarted all the time (DC Comics Wonder Woman series being a good example). Because of this, we're going to cheap out and create a surrogate key with an IDENTITY field. It's fast, it's easy, and I don't have to think too hard about it.
Why yes, Virginia. Developers and DBAs do actually create databases thinking "what is easy and fast?". Is that a good thing? It depends.
In addition to knowing what columns to use, we should also know what data types to use. I refer to BOL's handy-dandy data types article for thoughts on data types. I don't need unicode for this project, because my comic book collection is mostly English publications. Because we don't want to over-inflate our database files, we'll apply the "smallest data type possible" principal to our table designs. This means we will be adding up the data type size totals.
NOTE: There is a lot of available literature about pages, extents, and the affects they have on database design. Because of this, I won't go into the issue. It suffices to say we want to keep our table width below 8060 bytes when possible.
There are two types of tables: wide and narrow. A wide table has many columns. A narrow table has only a few columns. Tables can be long (a lot of records) or short (very few records). So a short, narrow table has few columns and few rows. A wide and long table has a lot of columns and a lot of rows. A narrow table can also be long and a wide table can also be short. The best performing tables tend to be narrow tables or short tables. Narrow and short together are the best to read off of. But in a real world database, it is not always possible to have the best performing tables. Just remember to do what you can while serving the best needs of the business.
Size Does Matter
The one performance fear every DBA has is the dreaded page split. Page splits happen when a row doesn't all fit in the same place (an 8KB page). The consequence of split pages is performance hits. No one ever notices the issue on short tables, but everyone notices on wide, long tables, or when too many of the same query hits even the short table. The system slows down, the customers complain, the boss starts yelling, and someone is suddenly out of a job. This is why keeping data types and their sizes in mind, even for a small project like this, is essential. Small projects have a tendency to become huge projects when the business adds scope changes.
Let's start with our surrogate key. Some people use UNIQUEIDENTIFIERS (aka GUIDs) as their surrogate keys. That's 16 bytes for a unique identifier, where an INT is only 4 bytes. For a small data set, 12 bytes doesn't make much of a difference. But my table will probably be long, even if it's going to be narrow. 12 bytes per row, for thousands of rows, adds up fast. so we'll opt for using the regular INT data type for our IDENTITY column. Even if we have to later upsize to BIGINT, we'll still save 4 bytes per row. Plus it's easier to search and match on than the GUIDs (in my opinion).
For the publisher and title columns, we'll use VARCHAR data types. Variable data types add extra bytes of overhead so SQL Server can expand for bigger values when needed. This is not always a good thing. Lots of overhead means performance issues later on. We could use CHAR, which doesn't have the overhead. But CHAR pads the values with extra spaces on the right. This padding can cause other issues, like when matching values to other tables or pulling values for reports. Copy-n-pasting of CHAR data is a major pain, too.
Short version: I prefer dealing with the extra overhead to using TRIM functions. So VARCHAR it is.
The ideal situation is where we know exactly to the letter how long my longest comic book title is and how long the longest publisher name is. The problem is, I have way too many comics to remember them all. Given the length of time it would take to go through my entire collection just to locate those items, it's faster and cheaper (costing out our time) to W.A.G. (wild a** guess) for our VARCHAR lengths. We can always upsize them later if we get them too short. We can make the assumption that publisher names are shorter than titles. So we'll pick a nice round 50 for the publisher name and 100 for our title.
When it comes to the issue number, it would be easy just to stick to TINYINT. While there are plenty of zero issues for comics, there are never any negative numbers. This makes a TINYINT a wonderful choice as it has the smallest byte size of all the numbers. After thinking about my collection, though, I realize that TINYINT won't work because only goes up to 255. One of my comics, Action Comics started over with issue 1 a couple of years ago. The last issue run, however, number 904. That's too big for TINYINT. So our issue number column needs to be either SMALLINT or INT. Since SMALLINT's largest size is 32,767 and no comic I've ever seen has an issue size larger than three digits, we'll stick with SMALLINT. SMALLINT uses 2 bytes less than the INT data type.
Adding up our data type sizes, we have 4 bytes for our ID, 52 bytes for Publisher, 102 bytes for Title, and 2 bytes for Issue Number. That's a total of 160 bytes, well below the page size of 8060. These leaves plenty of expansion room if we need it later on.
Our last consideration is whether or not we'll use the surrogate key (ComicID) as a search term or not. The answer? ComicID doesn't really mean anything beyond an attempt to uniquely identify each row. So probably not. That being the case we'll create ComicID as a NONCLUSTERED PRIMARY KEY and save the clustered index for the meat of the table (publisher and title).
USE ComicBooks; GO --Open our new database IF (SELECT Table_Name FROM INFORMATION_SCHEMA.Tables WHERE Table_Name = 'Comic') IS NOT NULL DROP TABLE dbo.Comic; --Drop the table if it already exists CREATE TABLE dbo.Comic (ComicID INT NOT NULL IDENTITY(1,1) CONSTRAINT pk_Comic_ComicID PRIMARY KEY NONCLUSTERED, Publisher VARCHAR(50) NOT NULL, Title VARCHAR(100) NOT NULL, IssueNo SMALLINT NOT NULL); --Create the table CREATE CLUSTERED INDEX CIDX_Comic_PublisherTitle ON dbo.Comic (Publisher, Title); --Create the clustered index
And now we have our comic book table into which all our comic collections can go. WHOO HOO! Time to start typing up that very, very, very long list of comics. Hopefully we'll manage to do that without getting distracted by all the pretty colors and cool fight scenes.
Data In...
The insert at this point is relatively simple. A VALUES or SELECT statement can be used. Going by the theory "the less typing the better," I choose to use a single INSERT statement, and a VALUES list. That way I don't have to do each record in its own INSERT statement or add UNION / UNION ALL between SELECTS.
INSERT INTO dbo.Comic (Publisher, Title, IssueNo) VALUES ('DC Comics', 'Prez', 1), ('DC Comics', 'Prez', 2), ('DC Comics', 'Prez', 3), ('DC Comics', 'Prez', 4), ('IDW', 'Doctor Who', 1), ('IDW', 'Doctor Who', 2), ('IDW', 'Transformers More than Meets the Eye', 1), ('IDW', 'Transformers More than Meets the Eye', 2), ('IDW', 'Doctor Who - The Four Doctors', 1), ('IDW', 'Doctor Who - The Four Doctors', 2), ('IDW', 'Doctor Who - The Four Doctors', 3), ('IDW', 'Doctor Who - The Four Doctors', 4), ('Marvel Comics', 'Starlord and Kitty Pryde', 1), ('Marvel Comics', 'Starlord and Kitty Pryde', 2), ('IDW', 'Transformers', 1), ('IDW', 'Transformers', 2), ('IDW', 'Doctor Who', 1), ('IDW', 'Doctor Who', 2), ('DC Comics / Marvel Comics', 'Uncanny X-Men and the New Teen Titans', 1), ('Vertigo', 'The Sandman', 1); SELECT ComicID, Publisher, Title, IssueNo FROM dbo.Comic ORDER BY Publisher;
This is when I realize there's a problem. Not only do I have a title that was co-published by two different houses, but I have multiple titles for the same comic. IDW used to only publish one Doctor Who and one Transformers title. Now they've got four Doctor Who titles (each with a different Doctor) and scads of Transformers titles, each with a different subtitle. Plus, The Sandman was not published under DC's house brand, but DC's offshoot imprint "Vertigo."
To fix this problem, we need a Subtitle column and an Imprint column. Since we might also need to search on IssueNo, we'll add a non-clustered index to the mix.
CREATE NONCLUSTERED INDEX IDX_Comic_TitleIssue ON dbo.Comic (Title, IssueNo) INCLUDE (ComicID, Publisher); --The non-clustered index ALTER TABLE dbo.Comic ADD SubTitle VARCHAR(50) NULL, Imprint VARCHAR(50) NULL; GO --The new columns UPDATE dbo.Comic SET Imprint = 'Vertigo' WHERE Title = 'The Sandman'; UPDATE dbo.Comic SET Subtitle = CASE Title WHEN 'Doctor Who - The Four Doctors' THEN 'The Four Doctors' ELSE 'More Than Meets the Eye' END, Title = CASE Title WHEN 'Doctor Who - The Four Doctors' THEN 'Doctor Who' ELSE 'Transformers' END WHERE Title IN ('Doctor Who - The Four Doctors','Transformers More than Meets the Eye'); --The data updates SELECT ComicID, Publisher, Imprint, Title, Subtitle, IssueNo FROM dbo.Comic ORDER BY Publisher; --The data check
Cool! That's solved. Now all we have to do is...
Wait a second. I have multiple issues of Doctor Who 1 and 2. They don't really have subtitles because at least one of these runs was before IDW branched out to cover the multiple Doctors. The clustered index, since it wasn't created as unique, also didn't prevent us from adding duplicate issues into the table. We could add a Character column to this table, but now we're running into Flat Table Syndrome. Most comics have more than one character in them, especially the crossover (guest star) issues. How many character columns do we want to add to this table? Wide, long tables are horrendous for performance. Maybe we'd better rethink this "designing the database on the fly" plan.
Conclusion
While actual database creation isn't that hard, there is value in taking our time to determine what we actually need. As we just discovered, I missed some items in my on-the-fly database design. If we'd modeled our data, we might have discovered the imprint and subtitles before we created the table. We'd have also discovered other comic book attributes that could be spun off into other tables.
So it's time to go back to the drawing board. If we're going to add a character column, we'd better first check what other comic book attributes we might need. Next article: The ABCs of Database Modeling.