Data Compression

Data compression is often misunderstood to cost CPU in exchange for smaller size on disk. Somewhat true, but that simple explanation ignores other savings that often result in net drop in CPU utilization.

Full disclosure: This is an Enterprise-ONLY feature introduced in SQL 2008. It is engrained in the structure of your data, so it also means you can’t take a backup of a database that has a compressed index and restore it to anything other than Enterprise or Developer Editions.

Here are the simple facts we’ll play with:

Two levels – row-level and page-level
Page-level is row-level plus extra compression
Compression ratios vary by column types, index width, and data
Data is compressed both on disk and in memory
CPU goes both ways, and it needs to be tested
- Uses CPU to compress on writes and index maintenance
- Uses CPU to decompress when used by a query
- Saves CPU with fewer physical reads
- Saves CPU with fewer logical reads
- And more…

Abstract Thought

This post is at a level of abstraction that doesn’t get into what happens in the background. My goal is to encourage you to test it out, understand why it helps, and be able to explain that in your change control process.

For those of you who aren’t satisfied with my “Gas pedal make car go fast” explanation, Jes Borland (b|bob|t) wrote A Look Inside SQL Server Row and Page Compression, and Brad McGehee (b|t) wrote A Quick Introduction to Data Compression in SQL Server 2008.

You can even dive into more details such as using different levels of compression on each partition of an index, or even talking to Joey D’Antoni (b|t) about the archival levels of compression on columnstore indexes.

There’s a lot of detail on how compression can cost CPU, but the details that save CPU are typically only mentioned in passing without doing a deep dive into the topic. Data Compression: Strategy, Capacity Planning and Best Practices mentions that less Logical I/O is less to consume CPU. SQL Server Database Compression is indirectly mentioning having a smaller B+Tree structure.

The purpose of this post isn’t to earn anyone a doctorate (or claim that I’m at that level), it’s more of a practitioner level.

What’s it do?

Each page is the same 8kb size but contains more data per page, as per Captain Obvious. This means less space on disk and backups. Those are nice, but I don’t care too much about that.

Then you read the data into memory so queries can use it. This is a physical I/O to disk that goes through the CPU (using extra CPU to decrypt it if you use TDE) to make it into memory. It stays compressed when in memory, so all of your indexes (not just this one) have more room to hang around and avoid more physical I/Os and the costs I just mentioned.

Finally, a query needs to use the data, and that has positives (+) and negatives (-). The data is more likely to be in cache (+) because it’s smaller and a page with more data is more likely to be referenced. It’s easier to get into cache if it wasn’t there already (+). Then it’s easier to get to the data because the smaller data may have fewer levels in the B+Tree (+). Along the way it has to decompress the root and intermediate level pages (-) which are always row-level compressed when you use any level of compression then decompress the leaf-level pages (-) which are compressed at the level you picked. However, there are fewer pages, which results in less Logical I/O (+).

You’re not going to accurately figure out the positives and negatives of that last paragraph. The important part is that you know there are positives AND negatives, which means you put away the calculus and just run some tests.

My experience is that if the data is compressed by 25% or more than it helps more than it hurts. Find how much you’ll save by running sp_estimate_data_compression_savings for both row-level and page-level compression. If you don’t get much extra compression with page-level then don’t even test it, it’s an added expense that needs to be justified.

What Compresses Well?

The hard way is to analyze each column, its data type, the data in that column, the width of the index, etc.. You can read the links in the Abstract Thought section to see what Brad and Jes have to say about it if you want. This will be very important if you’re designing tables and keeping how compressible the data is in mind, but less so if you’re compressing already existing indexes.

The easy way (my personal favorite) is to just run sp_estimate_data_compression_savings I just mentioned and actually compress the indexes on a non-prod server. Computers are good at math, let them do it.

How to Test

I’m not diving deep into testing here, but there are three things that stand out.

How much memory are you saving?
How do your queries perform?
How much faster is the data pulled from disk?

For how much memory you would save, look at my Cleaning Up the Buffer Pool post to see how much memory that index is using. Since you’re only changing how much space the data takes and not the columns of the indexes here, you can just multiply that by the new compression ratio. Use the actual ratio comparing the index size in prod to where you’re testing in non-prod to make sure it’s accurate. Yes, if you have a 10 GB index which tends to be 100% in cache that you just compressed 80%, it will be like you added 8 GB of memory in many ways.

I do query performance and how much faster the data is pulled from disk together, because that’s how it’s done in the real world. Pick your queries that hit that index, typically by looking in the plan cache or an XEvent session. Then, on a non-prod server, run the queries both with and without DBCC DROPCLEANBUFFERS, again, on a non-prod server.

You can remove compression on any index, down to the partition level, by doing ALTER INDEX … REBUILD WITH (DATA_COMPRESSION = NONE). Adding compression is the same statement with ROW or PAGE instead of NONE.

Sum It Up

Do this all in non-prod.

See what compresses well
Test it
Test it again

The End

Let’s hear from you. If you needed more data to make an informed choice, throw it in the comments where others can benefit from your experience, and I may even edit the post to add it in. Also, if this is making a big difference for a lot of people, I’ll do what I can to tweak the SEO and help more people find this through search engines.

The best compliment is a question. It means you value my expertise enough to want my thoughts and opinions.

Filed under: Buffer Pool, General Chat, Indexes, SQL Server, Tuning

Abstract Thought

What’s it do?

What Compresses Well?

How to Test

Sum It Up

The End

Rate

Share

Share

Rate

Data Compression

Abstract Thought

What’s it do?

What Compresses Well?

How to Test

Sum It Up

The End

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts