I guess more or less every reader has heard of denormalization and how it is
used to improve performance. It is said that a query that joins several tables
will perform bad, and that to remedy this you should denormalize your database
design. This 'advice' is however wrong, and it is wrong in several ways. In this
article I will show you why it is wrong and what you can do instead of following
this advice.
What is denormalization?
The fact that there is no single and clear answer to this question should in
itself be a warning flag that this is something you should be wary of. Compared
to normalization, there is no definition of what a denormalized table is. Note
that I don't use the term relation, because an R-table, i.e. a relation in table
form as stated in the relational model, is in 1NF by definition and therefore
always normalized. Instead, what most proponents of denormalization mean when
they say that you should denormalize a database design is that you should 'lower
the level of normalization', i.e. not have your tables fully normalized. In
other words, you should have more than one entity in a table. The reason for
doing this, they say, is that queries will otherwise have to join several tables
to return complete information, and this is supposedly bad for performance. Some
like to do this when the performance of some specific query or operation is not
as good as they want it, but I have even seen projects where there was an entire
phase in the project cycle dedicated to denormalizing the system!
Lets view a simple example of denormalization. Say we have an application that
use the pubs database and execute queries like the one below:
SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name FROM dbo.titles t INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id
This example use only a single join and might therefore not be one that would be
targeted for denormalization, but it will do as an example. Besides, just as
there is no definition for denormalization, there is no answer to the question
"how many joins are too many?". In order to avoid the join between the tables
titles and publishers, a new table, tp, is created, by running the following
statement:
SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name INTO dbo.tp FROM dbo.titles t INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id
You can now retrieve the same result as the initial query does by running a
SELECT-query without the join(s), just using the new table tp. But, as you
hopefully will see, this also means that you have created a situation where you
have redundant data that exists in both the original tables titles and
publishers, as well as in the denormalized table tp. Sure, one could argue that
you don't need to keep the original tables (at least not if tp would contain all
columns in both titles and publishers), but that still means you have redundant
data in the table tp. That is what normalization does, it removes these
redundancies by making sure you have only one entity in each table. In either
case, you have just made the process of keeping the integrity of your data much
more complicated than it is in a fully normalized design, and that is the most
important thing that proponents of denormalization forget about. The following
quote is from an article called
"The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1"
by Fabian Pascal:
"The only reason you may sometimes achieve better performance when you
denormalize your database is because you ignore the integrity implications.
If and when those are taken into account, you will lose the performance gain, if
any, from denormalization and may even end up with a net performance loss
relative to the fully normalized alternative."
Even though the integrity implications are the main reason you should never
denormalize, this article will not discuss those issues because Pascal and
others already does this in the above mentioned article and elsewhere (see
references below). I should however emphasize the parts "may sometimes achieve
better performance" and "performance gain, if any". Because a database by
definition should not be biased towards a specific application but should
instead be designed to perform well for all applications, denormalizing a table
is not even always a good thing for performance. Even if one application might
benefit (performance wise) by the denormalization, others, including future
applications that have not even been conceived yet, will not benefit. Note that
I still say "might benefit", even when talking about the specific application
that is the reason you want to denormalize a table, because other factors (like
extra I/O) might negate the performance gain proponents expect from removing a
join.
To complete this little debunk of denormalization I must also state that it
should not be possible to improve performance by denormalizing a database
design, since normalization (and denormalization, whatever the term actually
means) is a purely logical operation, while it is the physical layer that
determines performance. The reason you might sometimes achieve better
performance from a logical design choice (as denormalization) is that the DBMS
we use does not adequately separate the logical layer from the physical layer,
and because of this a logical design choice might still affect the physical
layer in a way that could sometimes give you better performance.
Indexed
views in SQL Server 2000
What the DBMS should be doing is completely separating the logical and physical
layers, thereby opening up an infinite number of possibilities to do physical
performance enhancing work 'under the hood'. This way you as a database designer
would only need to consider the logical design and thereby help the DBMS control
the integrity of the database, and your logical design decisions would not
affect the performance of the database. The DBMS could handle logical joins with
any physical structures it wants, as long as it makes sure the integrity is
maintained the way the logical design specifies it. SQL Server actually has a
way for us, the DBAs, to 'help' it do this for us, more or less (see summary as
well). Using indexed views we can keep our fully normalized logical design with
all the constraints intact, but still have a physical structure that lets us
execute our logical join without reducing performance.
An indexed view is created like a standard SQL Server view, i.e. a query
definition that is stored and then materialized when you use the view as a table
source in a query, by using the CREATE VIEW ddl statement. You then create a
unique clustered index on this view. Because the leaf level pages of clustered
indexes contain the actual data rows (see my earlier articles about clustered
indexes), this will physically store the result set of the query that the view
is defined by on disk in the same way as a clustered index on a table is stored.
There are several conditions that a view must meet to allow you to create a
clustered index on it. Some of them are SET options that must be set in a
specific way, others are rules for what the view definition can contain, but I
will leave it to you the reader to check for yourself in Books Online exactly
what they are.
Now you might be thinking that this sounds like the 'solution' I described
above, with the denormalized table tp. But even though the data is duplicated,
there are almost no similarities between these solutions. Especially, there are
two important differences:
- First of all, SQL Server will automatically transfer modifications made
to the data in the base tables to the indexed view, the same way it does for
an index created on a base table. This means that you don't have to create
any kind of complex constraints to maintain the integrity of the data as you
would need to do for a denormalized table like tp.
- The second reason is that SQL Server will automatically use the indexed
view even in queries that don't reference it, if it decides that the query
will benefit from it. This means that you don't need to rewrite existing
applications and queries to benefit from the indexed view, as you would need
to do with the denormalized table tp.
The version of Books Online that comes with SQL Server (no service pack)
incorrectly state that "You can create indexed views only if you install
Microsoft SQL Server 2000 Enterprise Edition or Microsoft SQL Server 2000
Developer Edition.", but this is a documentation error. What it should say is
that only SQL Server 2000 Enterprise Edition (and Developer Ed) will use the
indexed view as a stored table. Other editions of SQL Server will expand the
view just as it does for a standard view, even if it is directly referenced in
the query, unless you specify the query hint NOEXPAND. The updated version of
BOL (SP3) says exactly this "Indexed views can be created in any edition of SQL
Server 2000. In SQL Server 2000 Enterprise Edition, the query optimizer will
automatically consider the indexed view. To use an indexed view in all other
editions, the NOEXPAND hint must be used."
Testing indexed views
The small script below show you how to use indexed views instead of
denormalization. You must of course be using Enterprise Edition or Developer
Edition to run it.
USE pubs GO SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name FROM dbo.titles t INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id GO IF EXISTS(SELECT * FROM sysobjects WHERE name = 'v_tp') DROP VIEW v_tp GO CREATE VIEW dbo.v_tp WITH SCHEMABINDING AS SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name FROM dbo.titles t INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id GO CREATE UNIQUE CLUSTERED INDEX ixcv_tp ON dbo.v_tp (title_id, pub_id) GO -- Note in execution plan that even though the query doesn't specify -- the view, SQL Server will automatically use it instead of the base -- tables that are specified in the query. SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name FROM dbo.titles t INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id GO
Summary
Now that you've seen how you can use indexed views instead of denormalizing your
tables I hope I won't need to hear anyone proposing denormalization as an advice
to improve performance. Well, seriously, I never accept anyone advising a user
to denormalize because of the integrity implications, but I hope this article
might help in some situations. However, you should still remember that an
indexed view is just as any other index, it will have a negative effect on
modifications and you must of course consider this to the improvement you will
get for some queries. As you might notice, this is not an optimal solution as it
can still bias the database towards some applications. It also mix the logical
and physical layers, but this is unavoidable as an indexed view 'by definition'
is a mix of logical and physical.
References
SQL Server Books Online - Creating an Indexed View
The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1
The Dangerous Illusion: Denormalization, Performance and Integrity, Part 2
On normalization, Performance and Integrity
On The Relational Model and Physical Implementations
I guess more or less every reader has heard of denormalization and how it is
used to improve performance. It is said that a query that joins several tables
will perform bad, and that to remedy this you should denormalize your database
design. This 'advice' is however wrong, and it is wrong in several ways. In this
article I will show you why it is wrong and what you can do instead of following
this advice.