To those that don’t know me “Natural Born Killers (The Series)” sounds like a very strange choice for a SQL Server blog title and to be fair it is.
Those who do know me, will know that I have been presenting a session this year titled “Natural Born Killers, performance issues to avoid” around the UK. The “Series” element of the title is there because I have decided to serialise the session for the benefit of those who have been unable to see it thus far.
The plan is to release one topic from the presentation each day of SQL Relay 2013. Today is the second day of SQL Relay held in Leeds. If you have registered for the Birmingham, Norwich or Cardiff events you will be able to see this presentation live, so you might want to read this after the event as a reminder of everything that was covered.
Failing to prepare is preparing to fail
If you were to think of creating a solution using a database as a book then database and specifically table design has to be the prologue. You know you should read it, but many people just skip it and crack on with the book.
The problem with not reading the prologue is that you could find yourself in a horror story, because let’s not forget – you’re not supposed to judge a book by its cover.
The horror element about table design is that most people don’t think about the overhead of using the wrong data types or in fact aware of what data types are available to them and the intricacies of those choices.
A developer may think that the only overhead is one of storage. We’re often told that storage is cheap so that’s OK isn’t it. We’ll the main point of raising this topic is to tell you that it’s not!
The reason for my stance is quite simple, data that is on disk at some point had to reside in memory. The more data has to be written to disk the longer the process will take, this process is overseen by worker threads which are managed by the processor. If you’re using SQL Server 2012 your processor time is now a lot more expensive than it used to be in previous versions due to the licence change. Think that just by using better table design that you can increase the efficiency of your processors, the price difference in adding in an extra core pack is pretty expensive now. Make your FD’s day by cranking up the efficiency of what you already have.
What’s more, if this data is used frequently then it would have stayed in cache and would have been returned to one or many clients. This means extra data being sent across the network. Of course if you have slow clients it’s going to take them longer to be able to retrieve and display this data. If you can reduce this overhead you’ll have a lot more happy end users.
The larger your base data, the larger your indices will be too, this means more space required to rebuild, and a larger log file. In turn this translates to larger log backups, larger full and differential backups. This means more even more disk space and possibly more tapes. This in turn could lead on to either larger fire proof safes on site or an extra cost for offsite storage.
The impact on SLA’s
If that sounds bad, let’s now talk about your SLA’s:
If you have a shop that uses Database Mirroring in High Safety mode or replication you’ll be keen to keep latency down to a minimum. Just think of the performance penalty you are going to have hardening the extra data to disk in the two phase commit process of High Safety Database Mirroring or synchronous replicas in an AlwaysOn availability group. You could end up with unintentionally long responses to your users.
Larger log files can also mean a much longer time to recover during the crash recovery process, do yourself a favour and speak to your developers to make their transactions as short as possible or crash recovery time (I’m including cluster failovers here) will take a lot longer.
As an example let’s take the common decision of using varchar instead of char data type. As I’m sure you’re aware what the varchar datatype does is remove the need to pad out the field with spaces that the char data type requires if it does not fully populate the field.
As I mentioned before, this means that you could potentially save a lot of space and therefore valuable resources. There’s a famous quote penned by Stan Lee in Spiderman, “With great power comes great responsibility”.
You do need to be responsible when using varchar as there is an overhead with using them specifically 2 bytes per variable length attribute in the record.
In my time as a DBA I have seen such strange design choices as a VARCHAR(1). This choice actually resulted in that attribute having a real length of three, making it 3 times LESS efficient than a char(1) what really annoyed me about this attribute wasn’t that it should have been a char(1), but that it was being used as a Boolean field which meant that it should have actually been set as a bit.
Another common decision is to use a GUID’s as a clustered index. I’m not going to talk about all the reasons why this can be a bad idea in this section as I do want to talk more about it when I discuss the subject of indices (see day 4).
What I will say is at 16 bytes it will create a very wide clustered index, which might not seem a problem until you consider that every nonclustered index has to include the clustered key. This is why many data architects will choose a surrogate key suck as an integer based identity attribute as this is only 4 bytes, some 12 bytes smaller per row in every index on that table. Every index unless it is a filtered index will have one row in the index per row in the physical table, even if the data is not populated.
Dates are a popular way to waste space, many applications either do not need the granularity of time that datetime provides or even the time element itself.
DateTime is 8 bytes
SmallDatetime is 4 bytes
Date is 3 bytes
Let’s create a fictional table that uses some of the problems I have outlined over not unreasonable amount of 10 million records.
I’ll also do the same, but with a slightly more thoughtfully designed table
Let’s take a look at the size of these two tables:
As you can see the second table is smaller which will benefit us in a number of ways, just from making some informed choices about the data types we are going to use.
The next screen shot will show what this means for SQL Server, yes I know using “SELECT *” is bad, but this is to illustrate a point and you can bet it is happening in production systems right across the world as you read this.
To end this section on a positive note, I’m going to spin what we’ve covered on its head. Optimising your table design before you start allows you to:
· Minimise the CPU overhead.
· Increase the data that can be stored in the buffer cache.
· Reduce the amount of physical disk IO.
· Reduce the amount of network traffic.
· Reduce the data file size
· Reduce the working size of the transaction log
· Reduce the backup files sizes
· Keep transaction time to a minimum.
If you’re a developer or architect hopefully this has been of some use to you and you’ll now be able to appreciate the knock on effect that table design has on the resource usage of the entire SQL Server. If you’re a DBA use resources like this to help educate your development team to make your life easier!
I’ve included create table and population code below, so you can recreate this for yourself, just don’t do it on a production server and make sure you have plenty of space – around 5GB. You’ll need to implement these scripts if you want to follow the demo in tomorrows post on Scalar UDFs.
Code
Pre-requisites
I created a database called TableDesign_Bad, you could put this anywhere as long as you have the space.
Also I create a Numbers table in a database called DBAControl, again this could go anywhere as long as you update the 3 part name in the code.
Creating a numbers table
USE DBAControl; GO CREATE TABLE dbo.Numbers (n int); GO INSERT INTO dbo.Numbers (n) SELECT TOP (1000000) ROW_NUMBER() OVER(ORDER BY SC1.COLUMN_ID) FROM SYS.COLUMNS SC1 CROSS JOIN SYS.COLUMNS SC2 CROSS JOIN SYS.COLUMNS SC3; SELECT COUNT(*) FROM dbo.Numbers
Creating RubbishDataTypes table
use TableDesign_Bad; go CREATE TABLE dbo.RubbishDataTypes( RubbishID UNIQUEIDENTIFIER PRIMARY KEY CLUSTERED, RubbishDate DATETIME, RubbishBoolean VARCHAR(1), RandomAttr1 char(10), RandomAttr2 char(10)); go CREATE NONCLUSTERED INDEX ix_Bad_Date ON dbo.RubbishDataTypes(RubbishDate) go DECLARE @Loop TINYINT SET @Loop = 1 WHILE @Loop <=100 BEGIN CHECKPOINT BEGIN TRANSACTION INSERT INTO dbo.RubbishDataTypes (RubbishID ,RubbishDate ,RubbishBoolean ,RandomAttr1 ,RandomAttr2) SELECT TOP (100000) NEWID(), GETDATE(), 1, REPLICATE('A',10), REPLICATE('B',10) FROM DBAControl.dbo.Numbers N1; COMMIT TRANSACTION SET @Loop = @Loop + 1 END SELECT COUNT(*) FROM dbo.RubbishDataTypes; exec sp_spaceused 'dbo.RubbishDataTypes';
Creating BetterDataTypes table
use TableDesign_Bad; go CREATE TABLE dbo.BetterDataTypes( SurrogateID INT NOT NULL IDENTITY(1,1), BetterID UNIQUEIDENTIFIER NOT NULL, BetterDate DATE, BetterBoolean BIT, RandomAttr1 char(10), RandomAttr2 char(10) ) ON [DEFAULT]; go CREATE UNIQUE CLUSTERED INDEX IX_BetterDataTypes ON dbo.BetterDataTypes ( SurrogateID ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [DEFAULT] GO ALTER TABLE dbo.BetterDataTypes ADD CONSTRAINT PK_BetterDataTypes PRIMARY KEY NONCLUSTERED ( BetterID ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [DEFAULT] GO CREATE NONCLUSTERED INDEX ix_BetterDate ON dbo.BetterDataTypes(BetterDate) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [DEFAULT] go DECLARE @Loop TINYINT SET @Loop = 1 WHILE @Loop <=100 BEGIN CHECKPOINT BEGIN TRANSACTION INSERT INTO dbo.BetterDataTypes (BetterID ,BetterDate ,BetterBoolean ,RandomAttr1 ,RandomAttr2) SELECT TOP (100000) NEWID(), GETDATE(), 1, REPLICATE('A',10), REPLICATE('B',10) FROM DBAControl.dbo.Numbers N1; COMMIT TRANSACTION SET @Loop = @Loop + 1 END SELECT COUNT(*) FROM dbo.BetterDataTypes; EXEC sp_spaceused 'dbo.BetterDataTypes';
Creating SmallRubbishDataTypes table
USE [TableDesign_Bad] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO SET ANSI_PADDING ON GO SELECT TOP (10000) * INTO [dbo].[SmallRubbishDataTypes] FROM [dbo].[RubbishDataTypes]; SET ANSI_PADDING OFF GO SELECT COUNT(*) FROM [dbo].[SmallRubbishDataTypes]; GO