/* Even the simplest of databases can illustrate some of the problems of database design. We are going to start with a simple refactoring, and lead on to show how it is possible to do guilt-free 'apparent denormalization' within a properly normalized design.
The Person.Person table in AdventureWorks shows a poorly-considered check constraint. As well as giving details of name, email promotion, and an unnormalized mess of 'AdditionalContactInfo' and 'demographics', It associates a businessEntityID surrogate key with a type of person. This is enforced by a constraint.
*/
ALTER TABLE [Person].[Person]
WITH CHECK ADD CONSTRAINT [CK_Person_PersonType] CHECK
(([PersonType] IS NULL
OR (upper([PersonType])='GC'
OR upper([PersonType])='SP'
OR upper([PersonType])='EM'
OR upper([PersonType])='IN'
OR upper([PersonType])='VC'
OR upper([PersonType])='SC')))
GO
/*
I've gone on record listing this type of constraint as a code smell.
'Using constraints to restrict values in a column'
You can use a constraint to restrict the values permitted in a column to one of an distinct ‘enumeration’ list of values, just like what has done here, but it is usually better to define the values in a separate ‘lookup’ table and enforce the data restrictions with a foreign key constraint. This makes it much easier to maintain and will also avoid a code change every time a new value is added to the permitted range, as would be the case with constraints. This domain table approach allows for extensibility of such structures quite easily as well. (Anyone on the lookout for code smells should also look at 'AdditionalContactInfo' which shows code smell No.1 'Packing lists, complex data, or other multivariate attributes into a table column').
We need to avoid any DDL changes so we won't trigger a database drift alert. We don’t want to trigger a new version number or, saints preserve us, a new build of the application. We don't to have to save the modified DDL code of the table into version control! We just don't want the hassle.
Actually, there are real benefits of doing it differently. We should, of course separate this out into a table, and then refer to it via a foreign key.
'Not another table, please' I hear you gasp, 'I'm cursed with ten joins just to get a customer record. Any more and there will be unrest'. The way we are going to do it will not require any joins. Once we've done it, I'll show you how to extend the technique so as to help you design better-normalised databases using natural keys, and that require fewer joins.
Let's do the simple refactoring. (use a VM or copy your Adventureworks2012 first)
*/
USE AdventureWorksClone --you'll have done a quick copy of your Adventureworks before executing this
CREATE TABLE Person.PersonType
(Type NCHAR(2) NOT NULL PRIMARY KEY
-- , ...and any details of a person type including descriptions!
);
INSERT INTO Person.PersonType (Type)
VALUES ('GC'),('SP'),('EM'),('IN'),('VC'),('SC'); --the currently defined codes
/*
Now we alter the Persontype column */
ALTER TABLE [Person].[Person] DROP CONSTRAINT [CK_Person_PersonType]
GO
ALTER TABLE Person.Person ADD CONSTRAINT
a_Valid_PersonType FOREIGN KEY
(
PersonType
) REFERENCES Person.PersonType
(
Type
) ON UPDATE CASCADE --here there be magic. I'll show you in this article
ON DELETE NO ACTION;
GO
--So let's check that this constraint is working
UPDATE Person.Person SET PersonType='ST' WHERE businessEntityID=3;
/*
Bang! It thankfully didn't let us insert an invalid code. The constraint worked fine and stopped any bad data getting in.
Msg 547, Level 16, State 0, Line 1
The UPDATE statement conflicted with the FOREIGN KEY constraint "a_Valid_PersonType" (etc)
The statement has been terminated.
Please note that I've given the constraint a name that is meaningful in an error message. The reason is that it appears in error messages and can give anyone a clue as to why there was an error The name of a constraint is important and should never be wasted with silly affectations such as reverse hungarian notation or random numbers. However, you can only use the name once within a schema so you will need a system on larger databases to make them unique.
What have we gained?
Firstly, If we want to allow another PersonType, we just add it in.*/
INSERT INTO PersonType (Type) SELECT 'ST';
--Now, try the same insertion
UPDATE Person.Person SET PersonType='ST' WHERE businessEntityID=3;
--and this time, it allows it. Look no DDL changes! No version change! No source control check-in required!
--Secondly, as well as adding codes, we can change our mind about the codes we've used. We could of course delete people who have a particular code by deleting the code in the PersonType table but that would be a bit drastic (we would just alter the ON DELETE NO ACTION to ON DELETE CASCADE!) Here is 'before'.
SELECT persontype, BusinessEntityID FROM Person.Person WHERE BusinessEntityID IN (1,274,291,1491,1699,2091);
/*
persontype BusinessEntityID
---------- ----------------
EM 1
SP 274
SC 291
VC 1491
IN 1699
GC 2091 */
---now we alter the codes
UPDATE Persontype
SET type='AR' WHERE type='EM';
UPDATE Persontype
SET type='BP' WHERE type='SP';
UPDATE Persontype
SET type='CA' WHERE type='SC';
--and those codes are changed in the Person.Person table
SELECT persontype, BusinessEntityID FROM Person.Person WHERE BusinessEntityID IN (1,274,291,1491,1699,2091);
/*
persontype BusinessEntityID
---------- ----------------
AR 1
BP 274
CA 291
VC 1491
IN 1699
GC 2091 */
/*This type of cascading change is fine, but bear in mind that in a large database, a lot of work is being done under the covers. There is a hidden cost: However, we are dealing here with data that is relatively static rather than ephemeral, so can cope with this cost but must remember that it happens.
I have no idea what the personType in Adventureworks was supposed to do, since I can't see that it is referenced anywhere - Perhaps in the imaginary front-end only. Hmm. could there be another code-smell there, a column that isn't ever used within the database? */
/* This table that we have created is odd because it is only referenced via a foreign key constraint, and there is no obvious need I can see to reference it in a join, unless there were other columns in the table that occasionally needed referencing. Like many enumerations, it would be referenced by a front-end to populate a listbox or combo.
It leads on to a wider use for a 'natural' key. In a well-normalised table, you can easily have pretty narrow tables that never need to be included in joins, yet do their work of keeping data integrity nonetheless */
-- To demonstrate, we need some data, and what better than to start a Shoe website.
--let's define a simple set of colours.
USE ObedienceTraining --we create a new database to try out the next code
--start with a conditional teardown so we can do it over an over
IF EXISTS (SELECT 1 FROM information_schema.Tables
WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'CurrentPrice' ) DROP TABLE dbo.CurrentPrice;
IF EXISTS (SELECT 1 FROM information_schema.Tables
WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'product' ) DROP TABLE dbo.product;
IF EXISTS (SELECT 1 FROM information_schema.Tables
WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'productCategory' ) DROP TABLE dbo.productCategory;
IF EXISTS ( SELECT 1 FROM information_schema.Tables
WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'colour' ) DROP TABLE dbo.colour;
IF EXISTS (SELECT 1 FROM information_schema.Tables
WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'size' ) DROP TABLE dbo.size;
CREATE TABLE dbo.colour(colour VARCHAR(30) NOT NULL CONSTRAINT one_of_each_colour PRIMARY KEY);
INSERT INTO dbo.colour(colour) VALUES ('White'),('Silver'),('Gray'),
('Black'),('Red'),('Maroon'),('Yellow'),('Olive'),('Lime'),
('Green'),('Aqua'),('Teal'),('Blue'),('Navy'),('Fuchsia'),('Purple');
--let's also define a simple set of size-descriptions. Maybe we need ordinal information.
CREATE TABLE dbo.size(size VARCHAR(30) CONSTRAINT one_of_each_size PRIMARY KEY, TheOrder INT);
INSERT INTO dbo.size(size, TheOrder) VALUES ('minute',1),('small',2),('medium',3),('large',4),('huge',5);
/* just so we've got some data to play with, we'll have some product categories, Shoes in this case */
CREATE TABLE dbo.productCategory (
name VARCHAR(30) CONSTRAINT one_of_each_productcategory PRIMARY KEY
);
INSERT INTO dbo.productCategory (name)
VALUES ('ankle boot'),
('army boots'),('gym shoes'),('babooshes'),('ballet shoes'),('bar shoes'),('basketball shoes'),('beach shoes'),
('boat shoes'),('boots'),('bowling shoes'),('brogues'),('calcagnetti'),('Cambridge loafers'),('chopines'),
('ciabattines'),('cleats'),('climbing shoes'),('clogs'),('club shoes'),('court shoes'),('cowboy boots'),
('cycling shoes'),('deck shoes'),('dress shoes'),('elevator shoes'),('espadrilles'),('figure skates'),
('flip-flops'),('footgear'),('galoshes'),('getas'),('golf shoes'),('gumboots'),('heels'),('high heels'),
('hightop sneakers'),('hightops'),('hiking boots'),('huaraches'),('ice skates'),('inline skates'),('jackboots'),
('jump boots'),('kabkabs'),('kamiks'),('loafers'),('lotus slippers'),('louis heels'),('Mary Janes'),('moccasins'),
('monk shoes'),('mukluks'),('mules'),('open-toes shoes'),('Oxfords'),('pantofoles'),('penny loafers'),('platform shoes'),
('pointe shoes'),('poulaines'),('pumps'),('rainboots'),('riding boots'),('rollerblades'),('rollerskates'),
('running shoes'),('saddle shoes'),('sandals'),('shoes'),('skates'),('skate shoes'),('ski boots'),('slides'),
('sling-backs'),('slippers'),('sneakers'),('solees'),('steel-toe boots'),('stiletto heels'),('swim fins'),
('talarias'),('tap shoes'),('tapins'),('tennis shoes'),('thongs'),('toe shoes'),('track shoes'),('valenki'),
('veldtschoens'),('waders'),('wedge shoes'),('Wellington boots'),('wingtip shoes'),('zories')
/* Now we'll have a product table that lists lots of different product lines for our shoe shop. Yes, it is actually an sku of a product but we are now in database design fairyland, so we can pretend… */
CREATE TABLE dbo.product (
name VARCHAR(30) NOT NULL CONSTRAINT valid_Category REFERENCES productCategory ON UPDATE CASCADE,
colour VARCHAR(30) NOT NULL DEFAULT 'Black' CONSTRAINT valid_Colour REFERENCES colour ON UPDATE CASCADE,
size VARCHAR(30) NOT NULL DEFAULT 'Medium' CONSTRAINT valid_Size REFERENCES size ON UPDATE CASCADE,
CONSTRAINT unique_Product PRIMARY KEY (name, colour, size)
);
/*You’ll notice that we’ve deliberately chosen a wide primary key instead of going for a surrogate. There is a reason for this, which is far better to demonstrate than to explain. We’re soon going to demonstrate why we’ve taken this rather strange decision.
Firstly, we need to stock this with our possible range*/
INSERT INTO dbo.product (name,colour,size)
SELECT name, colour,size FROM
size
CROSS JOIN dbo.colour
CROSS JOIN dbo.productCategory;
/* Now if we are right so far ...
we can delete products from the list we don't stock */
DELETE FROM dbo.product WHERE name LIKE 'tapins'; --don't stock them
--but we can't insert a product that isn't on our list
INSERT INTO dbo.product (name,colour,size)
VALUES ('bedroom slippers','black','Medium');
/*
Msg 547, Level 16, State 0, Line 3
The INSERT statement conflicted with the FOREIGN KEY constraint "valid_Category" table "dbo.productCategory", column 'name'.
*/
--we can insert a valid product once
INSERT INTO dbo.product (name,colour,size)
VALUES ('tapins','black','Medium');
---but not twice!
INSERT INTO dbo.product (name,colour,size)
VALUES ('tapins','black','Medium');
/*
Msg 2627, Level 14, State 1, Line 1
Violation of PRIMARY KEY constraint 'unique_Product'. Cannot insert duplicate key in object 'dbo.product'.The duplicate key value is (tapins, black, Medium).
The statement has been terminated.
*/
/* what if we want to change the term 'minute' to 'petite'? */
UPDATE dbo.size SET size='petite' WHERE size='minute';
SELECT TOP 5 * FROM dbo.product WHERE name LIKE 'Zories' AND size='petite';
/*
name colour size
------------------------------ ------------------------------ ---------
zories Aqua petite
zories Black petite
zories Blue petite
zories Fuchsia petite
zories Gray petite
*/
/*
Well we never touched the product table to do this. we altered a single value in the ‘size’ table.
Hang on, what about if we alter the product category table? Zories are actually called 'Beach Sandals'.
*/
UPDATE dbo.ProductCategory SET name='Beach Sandals' WHERE name='Zories';
SELECT TOP 5 * FROM dbo.product WHERE name LIKE 'Beach Sandals' AND size='petite';
/*
name colour size
------------------------------ ------------------------------ ------------------------------
Beach Sandals Aqua petite
Beach Sandals Black petite
Beach Sandals Blue petite
Beach Sandals Fuchsia petite
Beach Sandals Gray petite
*/
/*
So what's happening here? The cascading updates we've put in are keeping everything shipshape. We are using 'natural keys' here so we have what is effectively denormalization without any guilt.
You might need to think a bit about how to extend this to a multilingual approach but you lose some of the denormalization*/
/* As a finale, We'll extend the model one last time to show you that you can extend this throughout the database. (note that this isn’t cost-free, since data is being changed automatically under the covers, but they are still happening. If you were going to use a system like this in production on a large table, you would need to do it in the maintenance window. We are, after all, talking of a change that isn’t the sort of thing you do every day.
Even if you can’t use this rather cool feature, you can still us a domain table because you can always disable the constraint, alter the columns conventionally, and then add it back, rather than cascade */
CREATE TABLE CurrentPrice (
name VARCHAR(30) NOT NULL,
colour VARCHAR(30) NOT NULL,
size VARCHAR(30) NOT NULL,
Price NUMERIC(6,2) NOT NULL CHECK (Price BETWEEN 0 AND 200.00),
Discount INTEGER NOT NULL DEFAULT 0 CHECK (Discount BETWEEN 0 AND 100),
CONSTRAINT ValidProduct FOREIGN KEY (name, colour, size) REFERENCES product(name, colour, size) ON UPDATE CASCADE,
CONSTRAINT OnlyOnePrice PRIMARY KEY CLUSTERED (name, colour, size)
);
--and now we put in test prices and discounts for every product sku
INSERT INTO dbo.currentPrice (name,colour,size,Price,Discount)
SELECT name,colour,size, convert(NUMERIC(6,2),rand(checksum(newid()))*125), CASE WHEN rand(checksum(newid()))*10>5 THEN 20 ELSE 0 END
FROM dbo.Product;
--Oh dear, management has spoken. we want the english spelling of the colour grey.
UPDATE dbo.colour SET colour='grey' WHERE colour='gray';
--(1 row(s) affected)
/*
Well, actually, the change has cascaded down to both the product table and the currentPrice table
*/
SELECT Colour, count(*)AS The_count FROM dbo.CurrentPrice WHERE colour LIKE 'gr_y' GROUP BY colour;
/*
Colour The_count
------------------------------ -----------
grey 470
*/
SELECT Colour, count(*) AS The_count FROM dbo.product WHERE colour LIKE 'gr_y' GROUP BY colour;
/*
Colour The_count
------------------------------ -----------
grey 470
*/
/*
...and it is going to cascade to any table that references the product table so we have achieved with a properly normalized schema an apparent denormalization that would get the goons of the Relational Protection Agency reaching for their Relvars. It isn’t actually denormalization at all. Every column in a normalized table must be part of a key, or specifically referencing a key. So in our example, while a product price doesn’t have a color, it takes color to identify a productPrice, because that is made up of product and price. OK, it is going to take more space but that isn't the problem we suffer from so much nowadays. If you're not sure that we have reduced the complexity of accessing this database hugely, why not try replacing all my trick code with the current habit of using identity fields everywhere and joining all the tables with explicit joins. Oh, and don't forget to keep all those constraints in there.
*/