Clustered Columnstore Indexes – part 50 (“Columnstore IO”)

Continuation from the previous 49 parts, starting from http://www.nikoport.com/2013/07/05/clustered-columnstore-indexes-part-1-intro/

This blogpost is focused on the reading part of the IO subsystem of the Columnstore Indexes. I will try to explain the improvements in the reading process as well as some of the differences to the traditional Row Storage.

The key performance point of reading part of the Columnstore Indexes is the read-ahead mechanism of SQL Server. When SQL Server (Enterprise Edition) is performing a sequential scan on a table with significant amount of data, the storage engine is trying to improve the performance by issuing read-ahead calls that will ensure that once CPU is ready to process the information, it is already pre-fetched into the memory and hence ready to be worked on.

The read-ahead will try to acquire up to 500 database pages ahead of the performed scan, while combining up to 64 contiguous pages into a single read. This all means that in practice SQL Server is capable to acquire from 8KB and up to 512KB of information in 1 single read while performing read-ahead operation.
This information is correct for the traditional Row Store, and the limitation makes a lot of sense, but since Columnstore Indexes introduction in SQL Server 2012 there is a very significant change to that information.

As you should know, Columnstore Indexes are compressed with a number of different algorithms with the final layers being compressed with the xVelocity (also known as a Vertipaq), with compressed binary data being stored on the disk contagiously.

Additionally, since compressed data is never updated (all new bits are inserted into the new Delta-Stores which might or not become compressed segments), there is a guarantee that the compressed Segments will never be modified.

This architecture gives an enormous advantage for any read-ahead operations for Columnstore Indexes which is naturally well used inside the SQLOS – the read-ahead limit for Columnstore Indexes is actually 16 times bigger than for the RowStore structures: a single read-ahead for Columnstore Indexes can acquire up to 8MB of data.

This is an improvement of a very significat scale – 16 times more data can be prepared for the CPU, which at the same time can process the Columnstore Index information at a higher speed, if we are able to get Batch Processing Mode for our Columnstore Index Scan operation.

For the demo part, this time I decided to go into different direction (because this is a post number 50 in the series, haha :)) and take some of the Stack Exchange data, Stack Exchange Data Dump to be more precise.

For the details on the download and the data import please visit Brent Ozar’s blogpost on how to query StackOverflow.com.

Basically after downloading the dumps, you will need to use the StackOverflow Data Dump Importer, developed by Jeremiah Peschka.

For the test, I have chosen to query PostHistory.Comments table, which in my case contains over 36+ Million Rows and 5 columns: 36.585.420 rows to be more precise. To test the reading capacity of Columnstore Indexes, I have decided to create a a slightly modified copy of the original table, calling it PostHistory.Comments_CCI_Modified using the following script, loading the data into a table with Clustered Columnstore Index, this way lowering the total time, since less information was written on the disk (Writes happen only after compression and my TempDB is on quite fast SSD drive).

CREATE TABLE [PostHistory].[Comments_CCI_Modified](
[Id] [INT] NOT NULL,
[CreationDate] [DATETIME] NOT NULL,
[PostId] [INT] NOT NULL,
[Score] [INT] NULL,
[UserId] [INT] NULL
);
CREATE CLUSTERED COLUMNSTORE INDEX CCI_Comments_Modified
ON [PostHistory].[Comments_CCI_Modified;
INSERT into [PostHistory].[Comments_CCI_Modified] 
SELECT Id, CreationDate, PostId, Score, UserId
FROM [PostHistory].[Comments];

Before advancing any further, I have also created a slightly modified version of the Comments table, which is compressed with the PAGE compression:

CREATE TABLE [PostHistory].[Comments_Modified](
[Id] [INT] NOT NULL PRIMARY KEY,
[CreationDate] [DATETIME] NOT NULL,
[PostId] [INT] NOT NULL,
[Score] [INT] NULL,
[UserId] [INT] NULL
) WITH (DATA_COMPRESSION = PAGE);
INSERT into [PostHistory].[Comments_Modified] 
SELECT Id, CreationDate, PostId, Score, UserId
FROM [PostHistory].[Comments_CCI_Modified];

To make sure that I am not inventing anything, I have checked on the fragmentation level of my Rowstore table, which in my case was ~4%, the number that should not prevent the major part of the prefetch process on the Rowstore table.

You know the significant difference that we are expecting from the Columnstore compression – let’s check on it:

EXEC sp_spaceused 'PostHistory.Comments_CCI_Modified';
EXEC sp_spaceused 'PostHistory.Comments_Modified';

Do you see the difference? If not, then consider the image on the right side, I created a bar graph for better understanding and visualisation.What? They look way too similar? Indeed they are! This happens because some of the data inside them is not getting compressed by the current implementation of Columnstore Technology in SQL Server. To be honest for this exercise this is the very best situation of all possible – because both Columnstore & Rowstore Indexes are of similar size, their scan should perform in a similar , comparable way, producing similar execution times, should their prefetching capabilities be similar.

A couple of words about the drive I have chosen – this is a slow drive. How slow it is you can see on the accompanying image of the right side – I have used CrystalDiskMark 3.0.3 to get the results for the sequential reads (90 MB/s) and for 4K reads (1.120 MB/s).

I bet that if you run this on your local drive you will get some better results (haha), but in my case I wanted to take something slow and to measure the difference between RowStore & Columnstore reads, because in the real world you will be reading large GB or TB of data and not just a couple Gigabytes like I do in this demo.

For monitoring the behaviour of the read-ahead, I have started the Performance Monitor, adding 2 of the key metrics from the SQLServer: Buffer Manager counter – Page reads/sec and Readahead pages/sec.

Observing and comparing both of the above specified metrics should lead me onto better understanding of the read-ahead behaviour for both Columnstore & Rowstore Indexes.

Now all I need to do, is to reset my Buffer Pool (with the command DBCC DropCleanBuffers) and run the command which will effectively scan the whole tables. Notice that I have included the the information on the statistics io and time, plus I am definitely taking a look at the actual execution plans for both of my queries, making sure that they did scan whole table with ~36.5 Million Rows:

SET STATISTICS IO, TIME ON
DECLARE @id AS INT,
@CreationDate AS DATETIME,
@PostId AS INT,
@Score AS INT,
@UserId AS INT;
DBCC DROPCLEANBUFFERS;
SELECT @id = id,
@CreationDate = [CreationDate],
@PostId = PostId,
@Score = Score,
@UserId = UserID
FROM PostHistory.Comments_CCI_Modified;
DBCC DROPCLEANBUFFERS;
SELECT @id = id,
@CreationDate = [CreationDate],
@PostId = PostId,
@Score = Score,
@UserId = UserID
FROM PostHistory.Comments_Modified;

On the left, you will see the picture that is thousand words worth – the first query reading (red) and read-ahead(green) performance compared.

I don’t even care much about exact read-ahead numbers at this point, because 2 similar sized structures reading took different amounts of time.

Let’s check on the actual execution plans, maybe something went wrong down there: as you can see on the right image the actual execution plans are similiar, we are simply scanning the whole table, reading the last rows into our variables. Thats it. And if you are laughing this lines pointing at the Batch Execution mode, than you are badly wrong, since to all my tests the actual execution mode had no actual influence on the reading IO or the read-ahead prefetching.

The real difference in the execution times or IO is not about those 2% that the execution plans are showing.

What about Statistics IO & Statistics TIME commands that I have activated for the execution:

SQL Server Execution Times:
   CPU time = 8687 ms,  elapsed time = 12900 ms.
SQL Server Execution Times:
   CPU time = 19235 ms,  elapsed time = 31899 ms.

Table 'Comments_CCI_Modified'. Scan count 1, logical reads 165684, physical reads 32, read-ahead reads 580533, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Comments_Modified'. Scan count 1, logical reads 86944, physical reads 2, read-ahead reads 86938, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

If we trust the displayed information :), than we naturally start asking serious questions on why would it make any sense on prefetching 600K Data Pages for one data structure while another similar-sized data structure is prefetching less than 87K Data Pages.

The only explanation that I can come with is that whenever we are reading the Columnstore Index Data Pages from the hard drive, we are fetching them into Columnstore Object Pool which is in turn decompresses the real data into the Buffer Pool, and between all those operations the reads are being accumulated plus a couple of bugs …

To prove the point, I will use one of the queries from my good friend Glenn Berry from SQLSkills:

-- Check sequential throughput results (MB/sec)
SELECT 8 * (32 + 580533)/12900 AS [ MB/sec Sequential Read Throughput];

Should the read-ahead information be correct, we would be talking about 360 MB/sec reads from this external drive, connected to the USB port of my computer. Nah …

The Rowstore information appears to be more correct – 21 MB/second:

-- Check sequential throughput results (MB/sec)
SELECT 8 * (2 + 86938)/31899 AS [ MB/sec Sequential Read Throughput];

The first step here would be naturally for Microsoft Development Team to show more detailed and separated information in the next version of SQL Server.

Conclusions

At the point the best thing to do would be to write a beautiful conclusion about the perfectness of Columnstore Indexes, yeah.

But wait, didn’t I mention that the tested table is a modified version of the Stack Overflow Data Dump? What about the complete one?

Sure thing – I have created and tested it as well. The real difference between the original and the modified version is just 1 field: [Text] AS Nvarchar(700)

To prepare this full version, please use the following script:

CREATE TABLE [PostHistory].[Comments_CCI](
[Id] [INT] NOT NULL,
[CreationDate] [DATETIME] NOT NULL,
[PostId] [INT] NOT NULL,
[Score] [INT] NULL,
[Text] [NVARCHAR](700) NOT NULL,
[UserId] [INT] NULL
);
CREATE CLUSTERED COLUMNSTORE INDEX CCI_Comments
ON [PostHistory].[Comments_CCI];
INSERT into [PostHistory].[Comments_CCI] 
SELECT Id, CreationDate, PostId, Score, [Text], UserId
FROM [PostHistory].[Comments];

For the Rowstore table the only that was added is the PAGE compression for the primary key:

ALTER TABLE [PostHistory].[Comments] ADD  CONSTRAINT [PK_Comments] PRIMARY KEY CLUSTERED 
(
[Id] ASC
) WITH (DATA_COMPRESSION = PAGE);

To check on the respective table sizes, here are the queries:

EXEC sp_spaceused 'PostHistory.Comments_CCI';
EXEC sp_spaceused 'PostHistory.Comments';

Here we have a clear difference of 5.6GB for the Columnstore Index and ~6.4GB for the RowStore Index. The expectation here are more than clear.

Our test queries are waiting for us:

SET STATISTICS IO, TIME ON
DECLARE @id AS INT,
@CreationDate AS DATETIME,
@PostId AS INT,
@Score AS INT,
@Text AS Nvarchar(700),
@UserId AS INT;
DBCC DROPCLEANBUFFERS
SELECT @id = id,
@CreationDate = [CreationDate],
@PostId = PostId,
@Score = Score,
@Text = [Text],
@UserId = UserID
FROM PostHistory.Comments_CCI
DBCC DROPCLEANBUFFERS
SELECT @id = id,
@CreationDate = [CreationDate],
@PostId = PostId,
@Score = Score,
@Text = [Text],
@UserId = UserID
FROM PostHistory.Comments;

This is an extract of the final part of the first query (Columnstore Index) and the begin of the second query execution (Rowstore Index). In the middle, you can see that the read-aheads drops down (red) as well the the reads (blue), and then after a couple of seconds they jump to some similar/comparable numbers.

So for the Columnstore Table I have obtained the following results from the SET STATISTICS command:

Table 'Comments_CCI'. Scan count 1, logical reads 2072298, physical reads 359, read-ahead reads 3317778, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
   CPU time = 69640 ms,  elapsed time = 261031 ms.

Table 'Comments'. Scan count 1, logical reads 799977, physical reads 2, read-ahead reads 799995, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
   CPU time = 26953 ms,  elapsed time = 260157 ms.

I will ignore the reads & read-aheads informations and just simply dive into spent time numbers: CPU: ~70 seconds (Columnstore) vs 27 seconds (Rowstore) and the total execution times have a real difference of around 1 second, while the total amount is all about 260 seconds, meaning that the difference for the execution time is around 0,04% Yes, that is 0,04% not the 4%

What is going on here ? What is the difference, we have just added a potential 1400 bytes (nvarchar(700)) to the table structure and it should improve the situation …

Well, well, well

Let’s take a look at the internals of our Columnstore table, comparing the Row Groups between the 2 tested Columnstore Tables:

SELECT OBJECT_NAME(object_id) AS TableName, COUNT(*) AS RowGroups, AVG(total_rows) AS AverageRowCount
FROM sys.column_store_row_groups 
WHERE object_id IN ( OBJECT_ID('PostHistory.Comments_CCI'), OBJECT_ID('PostHistory.Comments_CCI_Modified'))
GROUP BY object_id;

Mamma mia! The number of RowGroups differs in almost 10 times! Looking at the average number of rows per RowGroup I can see a similar difference of almost 10 times! What is going on there ? Why our RowGroups are being so unmercifully cut by the engine ?

What does the sys.column_store_dictionaries DMV says:

SELECT OBJECT_NAME(object_id) AS TableName, AVG(dict.on_disk_size)/1024./1024 AS AvgDictionarySize
FROM sys.column_store_dictionaries dict
INNER JOIN sys.partitions part
ON part.hobt_id = dict.hobt_id AND part.partition_id = dict.partition_id
WHERE part.object_id IN ( OBJECT_ID('PostHistory.Comments_CCI'), OBJECT_ID('PostHistory.Comments_CCI_Modified'))
GROUP BY part.object_id;

Ladies & Gentlemen, in the case of the FULL Table version (Comments_CCI), we have a dictionary pressure, showing us the effects of wide character columns.

If we check on the Row Groups sizes, we might understand better why read-ahead can’t take a good advantage of the

SELECT OBJECT_NAME(object_id) AS TableName, AVG(size_in_bytes)/1024./1024 AS AvgSizeInBytes
FROM sys.column_store_row_groups 
WHERE object_id IN ( OBJECT_ID('PostHistory.Comments_CCI'), OBJECT_ID('PostHistory.Comments_CCI_Modified'))
GROUP BY object_id;

The average size of the Row Group for the Colunstore Tables differs almost 8 times, thanks to the dictionary pressure, and so the read-ahead mechanism can’t take the full advantage of its potential, you can’t read 8MB ahead if the Row group is sized around 2 MB, and I assume since the engine tries to get the maximum number it is incurring an additional resources/time penalty for attempting to access something that does not exist.

To arrive to a final conclusion – yes, indeed Columnstore Indexes have an amazing read-ahead capability, but make sure that you are reading the RowGroups that will take the full advantage of this mechanism, and if your Columnstore table is suffering from some kind of pressure (memory or dictionary, for example) than be ready that the reading part will not bring any significant difference, compared to the Row Storage.

to be continued with Clustered Columnstore Indexes – part 51 (“SSIS, DataFlow & Max Buffer Memory”)

Clustered Columnstore Indexes – part 50 (“Columnstore IO”)

Conclusions

Rate

Share

Share

Rate

Clustered Columnstore Indexes – part 50 (“Columnstore IO”)

Conclusions

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts