SQLServerCentral Article

What is Denormalization?

,

I guess more or less every reader has heard of denormalization and how it is

used to improve performance. It is said that a query that joins several tables

will perform bad, and that to remedy this you should denormalize your database

design. This 'advice' is however wrong, and it is wrong in several ways. In this

article I will show you why it is wrong and what you can do instead of following

this advice.

What is denormalization?

The fact that there is no single and clear answer to this question should in

itself be a warning flag that this is something you should be wary of. Compared

to normalization, there is no definition of what a denormalized table is. Note

that I don't use the term relation, because an R-table, i.e. a relation in table

form as stated in the relational model, is in 1NF by definition and therefore

always normalized. Instead, what most proponents of denormalization mean when

they say that you should denormalize a database design is that you should 'lower

the level of normalization', i.e. not have your tables fully normalized. In

other words, you should have more than one entity in a table. The reason for

doing this, they say, is that queries will otherwise have to join several tables

to return complete information, and this is supposedly bad for performance. Some

like to do this when the performance of some specific query or operation is not

as good as they want it, but I have even seen projects where there was an entire

phase in the project cycle dedicated to denormalizing the system!

 

Lets view a simple example of denormalization. Say we have an application that

use the pubs database and execute queries like the one below:

SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name
FROM dbo.titles t
INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id

This example use only a single join and might therefore not be one that would be

targeted for denormalization, but it will do as an example. Besides, just as

there is no definition for denormalization, there is no answer to the question

"how many joins are too many?". In order to avoid the join between the tables

titles and publishers, a new table, tp, is created, by running the following

statement:

SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name
INTO dbo.tp
FROM dbo.titles t
INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id

You can now retrieve the same result as the initial query does by running a

SELECT-query without the join(s), just using the new table tp. But, as you

hopefully will see, this also means that you have created a situation where you

have redundant data that exists in both the original tables titles and

publishers, as well as in the denormalized table tp. Sure, one could argue that

you don't need to keep the original tables (at least not if tp would contain all

columns in both titles and publishers), but that still means you have redundant

data in the table tp. That is what normalization does, it removes these

redundancies by making sure you have only one entity in each table. In either

case, you have just made the process of keeping the integrity of your data much

more complicated than it is in a fully normalized design, and that is the most

important thing that proponents of denormalization forget about. The following

quote is from an article called

"The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1"

by Fabian Pascal:

 

"The only reason you may sometimes achieve better performance when you

denormalize your database is because you ignore the integrity implications.

If and when those are taken into account, you will lose the performance gain, if

any, from denormalization and may even end up with a net performance loss

relative to the fully normalized alternative."

 

Even though the integrity implications are the main reason you should never

denormalize, this article will not discuss those issues because Pascal and

others already does this in the above mentioned article and elsewhere (see

references below). I should however emphasize the parts "may sometimes achieve

better performance" and "performance gain, if any". Because a database by

definition should not be biased towards a specific application but should

instead be designed to perform well for all applications, denormalizing a table

is not even always a good thing for performance. Even if one application might

benefit (performance wise) by the denormalization, others, including future

applications that have not even been conceived yet, will not benefit. Note that

I still say "might benefit", even when talking about the specific application

that is the reason you want to denormalize a table, because other factors (like

extra I/O) might negate the performance gain proponents expect from removing a

join.

 

To complete this little debunk of denormalization I must also state that it

should not be possible to improve performance by denormalizing a database

design, since normalization (and denormalization, whatever the term actually

means) is a purely logical operation, while it is the physical layer that

determines performance. The reason you might sometimes achieve better

performance from a logical design choice (as denormalization) is that the DBMS

we use does not adequately separate the logical layer from the physical layer,

and because of this a logical design choice might still affect the physical

layer in a way that could sometimes give you better performance.

Indexed

views in SQL Server 2000

What the DBMS should be doing is completely separating the logical and physical

layers, thereby opening up an infinite number of possibilities to do physical

performance enhancing work 'under the hood'. This way you as a database designer

would only need to consider the logical design and thereby help the DBMS control

the integrity of the database, and your logical design decisions would not

affect the performance of the database. The DBMS could handle logical joins with

any physical structures it wants, as long as it makes sure the integrity is

maintained the way the logical design specifies it. SQL Server actually has a

way for us, the DBAs, to 'help' it do this for us, more or less (see summary as

well). Using indexed views we can keep our fully normalized logical design with

all the constraints intact, but still have a physical structure that lets us

execute our logical join without reducing performance.

 

An indexed view is created like a standard SQL Server view, i.e. a query

definition that is stored and then materialized when you use the view as a table

source in a query, by using the CREATE VIEW ddl statement. You then create a

unique clustered index on this view. Because the leaf level pages of clustered

indexes contain the actual data rows (see my earlier articles about clustered

indexes), this will physically store the result set of the query that the view

is defined by on disk in the same way as a clustered index on a table is stored.

There are several conditions that a view must meet to allow you to create a

clustered index on it. Some of them are SET options that must be set in a

specific way, others are rules for what the view definition can contain, but I

will leave it to you the reader to check for yourself in Books Online exactly

what they are.

 

Now you might be thinking that this sounds like the 'solution' I described

above, with the denormalized table tp. But even though the data is duplicated,

there are almost no similarities between these solutions. Especially, there are

two important differences:

  • First of all, SQL Server will automatically transfer modifications made

    to the data in the base tables to the indexed view, the same way it does for

    an index created on a base table. This means that you don't have to create

    any kind of complex constraints to maintain the integrity of the data as you

    would need to do for a denormalized table like tp.

  • The second reason is that SQL Server will automatically use the indexed

    view even in queries that don't reference it, if it decides that the query

    will benefit from it. This means that you don't need to rewrite existing

    applications and queries to benefit from the indexed view, as you would need

    to do with the denormalized table tp.

The version of Books Online that comes with SQL Server (no service pack)

incorrectly state that "You can create indexed views only if you install

Microsoft SQL Server 2000 Enterprise Edition or Microsoft SQL Server 2000

Developer Edition.", but this is a documentation error. What it should say is

that only SQL Server 2000 Enterprise Edition (and Developer Ed) will use the

indexed view as a stored table. Other editions of SQL Server will expand the

view just as it does for a standard view, even if it is directly referenced in

the query, unless you specify the query hint NOEXPAND. The updated version of

BOL (SP3) says exactly this "Indexed views can be created in any edition of SQL

Server 2000. In SQL Server 2000 Enterprise Edition, the query optimizer will

automatically consider the indexed view. To use an indexed view in all other

editions, the NOEXPAND hint must be used."

Testing indexed views

The small script below show you how to use indexed views instead of

denormalization. You must of course be using Enterprise Edition or Developer

Edition to run it.

USE pubs
GO
SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name
FROM dbo.titles t
INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id
GO
IF EXISTS(SELECT * FROM sysobjects WHERE name = 'v_tp')
  DROP VIEW v_tp
GO
CREATE VIEW dbo.v_tp
WITH SCHEMABINDING
AS
SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name
FROM dbo.titles t
INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id
GO
CREATE UNIQUE CLUSTERED INDEX ixcv_tp ON dbo.v_tp (title_id, pub_id)
GO
-- Note in execution plan that even though the query doesn't specify
-- the view, SQL Server will automatically use it instead of the base
-- tables that are specified in the query.
SELECT t.title_id, t.title, t.price, p.pub_id, p.pub_name
FROM dbo.titles t
INNER JOIN dbo.publishers p ON t.pub_id = p.pub_id
GO

Summary

Now that you've seen how you can use indexed views instead of denormalizing your

tables I hope I won't need to hear anyone proposing denormalization as an advice

to improve performance. Well, seriously, I never accept anyone advising a user

to denormalize because of the integrity implications, but I hope this article

might help in some situations. However, you should still remember that an

indexed view is just as any other index, it will have a negative effect on

modifications and you must of course consider this to the improvement you will

get for some queries. As you might notice, this is not an optimal solution as it

can still bias the database towards some applications. It also mix the logical

and physical layers, but this is unavoidable as an indexed view 'by definition'

is a mix of logical and physical.

References

SQL Server Books Online - Creating an Indexed View

The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1

The Dangerous Illusion: Denormalization, Performance and Integrity, Part 2

On normalization, Performance and Integrity

On The Relational Model and Physical Implementations

I guess more or less every reader has heard of denormalization and how it is

used to improve performance. It is said that a query that joins several tables

will perform bad, and that to remedy this you should denormalize your database

design. This 'advice' is however wrong, and it is wrong in several ways. In this

article I will show you why it is wrong and what you can do instead of following

this advice.

Rate

4.5 (2)

You rated this post out of 5. Change rating

Share

Share

Rate

4.5 (2)

You rated this post out of 5. Change rating