Why columnstore indexes are not very popular?

Question

Why columnstore indexes are not very popular?

Viewing 15 posts - 31 through 45 (of 46 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 736642 More actions · Answer 1

They can play tricks in the engine to use less space in a way you cannot with row storage. The index itself can be smaller than a rowstore index.

frederico_fonseca SSCoach Points: 16174 More actions · Answer 2

the paper i mentioned does give some insight to what happens under the covers - I could not find their previous paper and that would give even further insight to the compression algorithm.

Although it would seem for some that a "look-up table holding unique values and tokenized with a Primary Key on its Identity column" would behave the same way it is not as simple as that as the engine deals with them differently - and even with integers the vertipaq engine will do things with them that "help", where possible, reduce the size in disk taken by each key.

this link https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3 explain this in a bit more detail for particular examples.

As people start looking at how this engine works they may also decide to change their tables and how some columns are defined - at a cost of a slight code change.

take for example the case of a datetime column - storage for these on vertipaq would get the distinct values of a combination of year/month/day/hour/min/second which gives a considerable number of possible distinct values - by separating the column to its date and time components it reduces the possible distinct values and compression is therefore higher e.g. shorter dictionary.

see https://www.sqlbi.com/articles/optimizing-high-cardinality-columns-in-vertipaq/ for examples - although targeted for SSAS Tabular/PowerPivot the same technique can be applied to SQL Server in some cases.

Sergiy SSC Guru Points: 110208 More actions · Answer 3

Steve Jones - SSC Editor wrote:

They can play tricks in the engine to use less space in a way you cannot with row storage. The index itself can be smaller than a rowstore index.

Steven, where do you think that extra space saving could come from?

do you have a test case to prove it?

_____________
Code for TallyGenerator

Sergiy SSC Guru Points: 110208 More actions · Answer 4

Frederico, do you have any test case to support your speculations?

I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.

_____________
Code for TallyGenerator

ZZartin SSC-Dedicated Points: 30921 More actions · Answer 5

Sergiy wrote:

Steve Jones - SSC Editor wrote:
Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.
Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?

Is that a serious question? We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.

That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?

frederico_fonseca SSCoach Points: 16174 More actions · Answer 6

I should not have had the need to do this - all the links that I supplied have this info one way or another and more.

small test case - 5 years worth of data - all days * all seconds of the day

create table test1
(datefull datetime
)

create clustered columnstore index cci_test1 on test1 with (data_compression = columnstore)

create table test2
(dateonly date
,timeonly time
)

create clustered columnstore index cci_test2 on test2 with (data_compression = columnstore)


  declare @startdate datetime = '1899-12-31';
  declare @years int = 5;

  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                )
       , E2(N) AS (SELECT 1 FROM E1 a, E1 b)
       , E4(N) AS (SELECT 1 FROM E2 a, E2 b)
       , E6(N) AS (SELECT 1 FROM E4 a, E2 b)
       , secs(n, sec) as (select top 86400 *
                          , ROW_NUMBER() over(order by n) as sec
                     from e6) -- number seconds in a day
       , days(n, day) as (select top (convert(int, @years * 365.25)) *
                          , ROW_NUMBER() over(order by n) as day
                     from e6)

-- insert full date into test1
insert into test1                       
select final.newdate
from days y1
outer apply (select dateadd(day, y1.day, @startdate) as basedate) base
outer apply (select dateadd(second, secs.sec, base.basedate) as newdate
             from secs) final
order by final.newdate

-- from test1 insert into test2 spliting date and time
insert into test2
select convert(date, datefull), convert(time, datefull)
from test1

-- now report on sizes of both tables

select pa.tablename
     , pa.indexname
     , css.column_id
     , min(min_data_id) as min_data_id
     , max(max_data_id) as max_data_id
     , sum(row_count) as row_count
     , sum(on_disk_size) as on_disk_size
     , count(*) as number_segments
     , max(magnitude) as magnitude
from sys.column_store_segments css
outer apply (select top 1 ix.name as indexname
                        , object_name(pa.object_id) as tablename
       from sys.partitions pa
       inner join sys.indexes ix
           on ix.index_id = pa.index_id
           and ix.object_id = pa.object_id
       where pa.partition_id = css.partition_id
) pa
group by pa.tablename
       , pa.indexname
       , css.column_id

select s.Name as schemaname
     , t.Name as tablename
     , p.rows as rowcounts
     , cast(round((sum(a.used_pages) / 128.00), 2) as numeric(36, 2)) as used_mb
     , cast(round((sum(a.total_pages) - sum(a.used_pages)) / 128.00, 2) as numeric(36, 2)) as unused_mb
     , cast(round((sum(a.total_pages) / 128.00), 2) as numeric(36, 2)) as total_mb
from sys.tables t
inner join sys.indexes i
    on t.object_id = i.object_id
inner join sys.partitions p
    on i.object_id = p.object_id
    and i.index_id = p.index_id
inner join sys.allocation_units a
    on p.partition_id = a.container_id
inner join sys.schemas s
    on t.schema_id = s.schema_id
group by t.Name
       , s.Name
       , p.Rows
order by s.Name
       , t.Name

output of above queries is
tablenameindexnamecolumn_idmin_data_idmax_data_idrow_counton_disk_sizenumber_segmentsmagnitude
test2cci_test21693595695421157766400114032151-1
test2cci_test22086399000000015776640042079980815110000000
test1cci_test11300784261028249615776640012622222241511

schemanametablenamerowcountsused_mbunused_mbtotal_mb
dbotest11577664001231.562.201233.77
dbotest2157766400411.913.42415.33

Sergiy SSC Guru Points: 110208 More actions · Answer 7

ZZartin wrote:

Sergiy wrote:
Steve Jones - SSC Editor wrote:
Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.
Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?
Is that a serious question? We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.
That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?

Are you trying to dupe me into some religious cult?

Sorry, I don't do cults.

_____________
Code for TallyGenerator

Sergiy SSC Guru Points: 110208 More actions · Answer 8

frederico_fonseca wrote:

small test case - 5 years worth of data - all days * all seconds of the day

Thanks Frederico. Cool stuff.

I only made one change to your script:

create clustered index cci_test1 on test1  (datefull)

as we're trying to compare use cases of columnstore vs. b-tree indexes.

I hope you don't mind.

Now, let's say we need to find some event(s) which happened on a Christmas night in a year, say, 1904.

For Test1 I've got no question:

select * from Test1
where datefull >= '19001224 19:00' and datefull < '19001225 07:00'

How do you write a query returning the same set for Test2?

_____________
Code for TallyGenerator

ZZartin SSC-Dedicated Points: 30921 More actions · Answer 9

Sergiy wrote:

ZZartin wrote:
Sergiy wrote:
Steve Jones - SSC Editor wrote:
Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.
Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?
Is that a serious question? We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.
That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?
Are you trying to dupe me into some religious cult?
Sorry, I don't do cults.

/shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.

Sergiy SSC Guru Points: 110208 More actions · Answer 10

ZZartin wrote:

/shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.

Is doing less typing your only concern when you choose a database design approach?

_____________
Code for TallyGenerator

Grant Fritchey SSC Guru Points: 398916 More actions · Answer 11

Sergiy wrote:

Frederico, do you have any test case to support your speculations?
I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.

The best place to learn about ColumnStore is from Niko Neugebauer. Niko has examples that show how the compression works.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

frederico_fonseca SSCoach Points: 16174 More actions · Answer 12

again some basics - few options to do it - and not exhaustive for sure and if anyone has a "faster" method please supply it and i can test it on same machine for comparing performance

server is an intel I7 with 4GB allocated to SQL Server - 4 cores/8 HT - maxdop 4, CTP 50

local attached SSD drive

I've also created 2 new tables - same columns and data as the original 2 ones, but these were created with a normal clustered index with page compression

table sizes below

schemaname	tablename	rowcounts	used_mb	unused_mb	total_mb
dbo	test1	157766400	1231.56	2.20	1233.77
dbo	test1_rowstore	157766400	1678.48	0.32	1678.80
dbo	test2	157766400	411.91	3.42	415.33
dbo	test2_rowstore	157766400	1683.83	0.22	1684.05

set statistics io on
set statistics time on
set nocount on
print '-- test1 - datefull'
select *
from Test1
where datefull >= '19001224 19:00'
    and datefull < '19001225 07:00'

-- this is the method I would use
print '-- test2 - method 1 - dateadd'
select *
from Test2
where dateonly >= '19001224'
    and dateonly <= '19001225'
    and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) >= '19001224 19:00')
    and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) < '19001225 07:00')

-- we could also do like this but not an option I would pick up unless absolutely required for performance

print '-- test2 - method 2 - union all'
select *
from Test2
where dateonly > '19001224'
    and dateonly < '19001225'

union all

select *
from Test2
where (dateonly = '19001224' and timeonly >= '19:00')

union all

select *
from Test2
where (dateonly = '19001225' and timeonly < '07:00')



print '-- test2 - method 3 - or condition'
-- following method would most of the times be the one used by developers - not the best option as predicates aren't pushed down the vertipaq engine the same way
-- resulting in more reads
select *
from Test2
where dateonly > '19001224'
    and dateonly < '19001225'
    or (dateonly = '19001224' and timeonly >= '19:00')
    or (dateonly = '19001225' and timeonly < '07:00')

print '-- rowstore test1'
select *
from Test1_rowstore
where datefull >= '19001224 19:00'
    and datefull < '19001225 07:00'

-- this is the method I would use
print '-- rowstore test2 - method 1 - dateadd'
select *
from Test2_rowstore
where dateonly >= '19001224'
    and dateonly <= '19001225'
    and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) >= '19001224 19:00')
    and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) < '19001225 07:00')

Times - warm first

-- test1 - datefull
Table 'test1'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6291, lob physical reads 0, lob read-ahead reads 0.
Table 'test1'. Segment reads 3, segment skipped 148.

SQL Server Execution Times: CPU time = 31 ms, elapsed time = 222 ms.

-- test2 - method 1 - dateadd
Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 2856, lob physical reads 0, lob read-ahead reads 0.
Table 'test2'. Segment reads 4, segment skipped 147.

SQL Server Execution Times: CPU time = 47 ms, elapsed time = 324 ms.

-- test2 - method 2 - union all
Table 'test2'. Scan count 12, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 8562, lob physical reads 0, lob read-ahead reads 0.
Table 'test2'. Segment reads 12, segment skipped 441.

SQL Server Execution Times: CPU time = 16 ms, elapsed time = 242 ms.

-- test2 - method 3 - or condition
Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 106995, lob physical reads 0, lob read-ahead reads 0.
Table 'test2'. Segment reads 151, segment skipped 0.

SQL Server Execution Times: CPU time = 922 ms, elapsed time = 317 ms.

-- rowstore test1
Table 'test1_rowstore'. Scan count 1, logical reads 64, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 232 ms.

-- rowstore test2 - method 1 - dateadd
Table 'test2_rowstore'. Scan count 1, logical reads 240, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 62 ms, elapsed time = 219 ms.

now cold times - after instance restart

-- test1 - datefull
Table 'test1'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6291, lob physical reads 0, lob read-ahead reads 19317.
Table 'test1'. Segment reads 3, segment skipped 148.

SQL Server Execution Times: CPU time = 46 ms, elapsed time = 218 ms.

-- test2 - method 1 - dateadd
Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 2856, lob physical reads 0, lob read-ahead reads 10040.
Table 'test2'. Segment reads 4, segment skipped 147.

SQL Server Execution Times: CPU time = 31 ms, elapsed time = 253 ms.

-- test2 - method 2 - union all
Table 'test2'. Scan count 12, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 8565, lob physical reads 0, lob read-ahead reads 0.
Table 'test2'. Segment reads 12, segment skipped 441.

SQL Server Execution Times: CPU time = 31 ms, elapsed time = 299 ms.

-- test2 - method 3 - or condition
Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 106995, lob physical reads 0, lob read-ahead reads 367381.
Table 'test2'. Segment reads 151, segment skipped 0.

SQL Server Execution Times: CPU time = 1234 ms, elapsed time = 579 ms.

-- rowstore test1
Table 'test1_rowstore'. Scan count 1, logical reads 64, physical reads 3, read-ahead reads 60, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 32 ms, elapsed time = 191 ms.

-- rowstore test2 - method 1 - dateadd
Table 'test2_rowstore'. Scan count 1, logical reads 240, physical reads 3, read-ahead reads 236, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 47 ms, elapsed time = 269 ms.

This reply was modified 5 years ago by frederico_fonseca.

ZZartin SSC-Dedicated Points: 30921 More actions · Answer 13

Sergiy wrote:

ZZartin wrote:
/shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.
Is doing less typing your only concern when you choose a database design approach?

Not needlessly recreating the wheel is a concern. Especially when the proposed alternative is clunkier and harder to deal with than just letting a column store work as intended.

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 14

Grant Fritchey wrote:

Sergiy wrote:
Frederico, do you have any test case to support your speculations?
I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.
The best place to learn about ColumnStore is from Niko Neugebauer. Niko has examples that show how the compression works.

Lordy. I've obviously not read all of that (129 items currently in his top level TOC) but the man does appear to have covered everything that one could ask. Thanks for the link, Grant. Bookmarked for sure!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 15

Sergiy wrote:

I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.

That's very true. And, sorry to say, I've never done a Yabingooducklehoo to see if such a thing exists.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)