December 26, 2019 at 5:16 pm
They can play tricks in the engine to use less space in a way you cannot with row storage. The index itself can be smaller than a rowstore index.
December 26, 2019 at 5:54 pm
the paper i mentioned does give some insight to what happens under the covers - I could not find their previous paper and that would give even further insight to the compression algorithm.
Although it would seem for some that a "look-up table holding unique values and tokenized with a Primary Key on its Identity column" would behave the same way it is not as simple as that as the engine deals with them differently - and even with integers the vertipaq engine will do things with them that "help", where possible, reduce the size in disk taken by each key.
this link https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3 explain this in a bit more detail for particular examples.
As people start looking at how this engine works they may also decide to change their tables and how some columns are defined - at a cost of a slight code change.
take for example the case of a datetime column - storage for these on vertipaq would get the distinct values of a combination of year/month/day/hour/min/second which gives a considerable number of possible distinct values - by separating the column to its date and time components it reduces the possible distinct values and compression is therefore higher e.g. shorter dictionary.
see https://www.sqlbi.com/articles/optimizing-high-cardinality-columns-in-vertipaq/ for examples - although targeted for SSAS Tabular/PowerPivot the same technique can be applied to SQL Server in some cases.
December 26, 2019 at 10:31 pm
They can play tricks in the engine to use less space in a way you cannot with row storage. The index itself can be smaller than a rowstore index.
Steven, where do you think that extra space saving could come from?
do you have a test case to prove it?
_____________
Code for TallyGenerator
December 26, 2019 at 10:43 pm
Frederico, do you have any test case to support your speculations?
I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.
_____________
Code for TallyGenerator
December 26, 2019 at 11:40 pm
Steve Jones - SSC Editor wrote:Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.
Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?
Is that a serious question? We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.
That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?
December 27, 2019 at 12:36 am
I should not have had the need to do this - all the links that I supplied have this info one way or another and more.
small test case - 5 years worth of data - all days * all seconds of the day
create table test1
(datefull datetime
)
create clustered columnstore index cci_test1 on test1 with (data_compression = columnstore)
create table test2
(dateonly date
,timeonly time
)
create clustered columnstore index cci_test2 on test2 with (data_compression = columnstore)
declare @startdate datetime = '1899-12-31';
declare @years int = 5;
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
, E2(N) AS (SELECT 1 FROM E1 a, E1 b)
, E4(N) AS (SELECT 1 FROM E2 a, E2 b)
, E6(N) AS (SELECT 1 FROM E4 a, E2 b)
, secs(n, sec) as (select top 86400 *
, ROW_NUMBER() over(order by n) as sec
from e6) -- number seconds in a day
, days(n, day) as (select top (convert(int, @years * 365.25)) *
, ROW_NUMBER() over(order by n) as day
from e6)
-- insert full date into test1
insert into test1
select final.newdate
from days y1
outer apply (select dateadd(day, y1.day, @startdate) as basedate) base
outer apply (select dateadd(second, secs.sec, base.basedate) as newdate
from secs) final
order by final.newdate
-- from test1 insert into test2 spliting date and time
insert into test2
select convert(date, datefull), convert(time, datefull)
from test1
-- now report on sizes of both tables
select pa.tablename
, pa.indexname
, css.column_id
, min(min_data_id) as min_data_id
, max(max_data_id) as max_data_id
, sum(row_count) as row_count
, sum(on_disk_size) as on_disk_size
, count(*) as number_segments
, max(magnitude) as magnitude
from sys.column_store_segments css
outer apply (select top 1 ix.name as indexname
, object_name(pa.object_id) as tablename
from sys.partitions pa
inner join sys.indexes ix
on ix.index_id = pa.index_id
and ix.object_id = pa.object_id
where pa.partition_id = css.partition_id
) pa
group by pa.tablename
, pa.indexname
, css.column_id
select s.Name as schemaname
, t.Name as tablename
, p.rows as rowcounts
, cast(round((sum(a.used_pages) / 128.00), 2) as numeric(36, 2)) as used_mb
, cast(round((sum(a.total_pages) - sum(a.used_pages)) / 128.00, 2) as numeric(36, 2)) as unused_mb
, cast(round((sum(a.total_pages) / 128.00), 2) as numeric(36, 2)) as total_mb
from sys.tables t
inner join sys.indexes i
on t.object_id = i.object_id
inner join sys.partitions p
on i.object_id = p.object_id
and i.index_id = p.index_id
inner join sys.allocation_units a
on p.partition_id = a.container_id
inner join sys.schemas s
on t.schema_id = s.schema_id
group by t.Name
, s.Name
, p.Rows
order by s.Name
, t.Name
output of above queries is
tablenameindexnamecolumn_idmin_data_idmax_data_idrow_counton_disk_sizenumber_segmentsmagnitude
test2cci_test21693595695421157766400114032151-1
test2cci_test22086399000000015776640042079980815110000000
test1cci_test11300784261028249615776640012622222241511
schemanametablenamerowcountsused_mbunused_mbtotal_mb
dbotest11577664001231.562.201233.77
dbotest2157766400411.913.42415.33
December 27, 2019 at 2:25 am
Sergiy wrote:Steve Jones - SSC Editor wrote:Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.
Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?
Is that a serious question? We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.
That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?
Are you trying to dupe me into some religious cult?
Sorry, I don't do cults.
_____________
Code for TallyGenerator
December 27, 2019 at 2:39 am
small test case - 5 years worth of data - all days * all seconds of the day
Thanks Frederico. Cool stuff.
I only made one change to your script:
create clustered index cci_test1 on test1 (datefull)
as we're trying to compare use cases of columnstore vs. b-tree indexes.
I hope you don't mind.
Now, let's say we need to find some event(s) which happened on a Christmas night in a year, say, 1904.
For Test1 I've got no question:
select * from Test1
where datefull >= '19001224 19:00' and datefull < '19001225 07:00'
How do you write a query returning the same set for Test2?
_____________
Code for TallyGenerator
December 27, 2019 at 3:12 am
ZZartin wrote:Sergiy wrote:Steve Jones - SSC Editor wrote:Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.
Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?
Is that a serious question? We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.
That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?
Are you trying to dupe me into some religious cult?
Sorry, I don't do cults.
/shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.
December 27, 2019 at 7:52 am
/shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.
Is doing less typing your only concern when you choose a database design approach?
_____________
Code for TallyGenerator
December 27, 2019 at 11:02 am
Frederico, do you have any test case to support your speculations?
I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.
The best place to learn about ColumnStore is from Niko Neugebauer. Niko has examples that show how the compression works.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 27, 2019 at 12:11 pm
again some basics - few options to do it - and not exhaustive for sure and if anyone has a "faster" method please supply it and i can test it on same machine for comparing performance
server is an intel I7 with 4GB allocated to SQL Server - 4 cores/8 HT - maxdop 4, CTP 50
local attached SSD drive
I've also created 2 new tables - same columns and data as the original 2 ones, but these were created with a normal clustered index with page compression
table sizes below
schemaname | tablename | rowcounts | used_mb | unused_mb | total_mb |
---|---|---|---|---|---|
dbo | test1 | 157766400 | 1231.56 | 2.20 | 1233.77 |
dbo | test1_rowstore | 157766400 | 1678.48 | 0.32 | 1678.80 |
dbo | test2 | 157766400 | 411.91 | 3.42 | 415.33 |
dbo | test2_rowstore | 157766400 | 1683.83 | 0.22 | 1684.05 |
set statistics io on
set statistics time on
set nocount on
print '-- test1 - datefull'
select *
from Test1
where datefull >= '19001224 19:00'
and datefull < '19001225 07:00'
-- this is the method I would use
print '-- test2 - method 1 - dateadd'
select *
from Test2
where dateonly >= '19001224'
and dateonly <= '19001225'
and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) >= '19001224 19:00')
and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) < '19001225 07:00')
-- we could also do like this but not an option I would pick up unless absolutely required for performance
print '-- test2 - method 2 - union all'
select *
from Test2
where dateonly > '19001224'
and dateonly < '19001225'
union all
select *
from Test2
where (dateonly = '19001224' and timeonly >= '19:00')
union all
select *
from Test2
where (dateonly = '19001225' and timeonly < '07:00')
print '-- test2 - method 3 - or condition'
-- following method would most of the times be the one used by developers - not the best option as predicates aren't pushed down the vertipaq engine the same way
-- resulting in more reads
select *
from Test2
where dateonly > '19001224'
and dateonly < '19001225'
or (dateonly = '19001224' and timeonly >= '19:00')
or (dateonly = '19001225' and timeonly < '07:00')
print '-- rowstore test1'
select *
from Test1_rowstore
where datefull >= '19001224 19:00'
and datefull < '19001225 07:00'
-- this is the method I would use
print '-- rowstore test2 - method 1 - dateadd'
select *
from Test2_rowstore
where dateonly >= '19001224'
and dateonly <= '19001225'
and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) >= '19001224 19:00')
and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) < '19001225 07:00')
Times - warm first
-- test1 - datefull
Table 'test1'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6291, lob physical reads 0, lob read-ahead reads 0.
Table 'test1'. Segment reads 3, segment skipped 148.
SQL Server Execution Times: CPU time = 31 ms, elapsed time = 222 ms.
-- test2 - method 1 - dateadd
Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 2856, lob physical reads 0, lob read-ahead reads 0.
Table 'test2'. Segment reads 4, segment skipped 147.
SQL Server Execution Times: CPU time = 47 ms, elapsed time = 324 ms.
-- test2 - method 2 - union all
Table 'test2'. Scan count 12, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 8562, lob physical reads 0, lob read-ahead reads 0.
Table 'test2'. Segment reads 12, segment skipped 441.
SQL Server Execution Times: CPU time = 16 ms, elapsed time = 242 ms.
-- test2 - method 3 - or condition
Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 106995, lob physical reads 0, lob read-ahead reads 0.
Table 'test2'. Segment reads 151, segment skipped 0.
SQL Server Execution Times: CPU time = 922 ms, elapsed time = 317 ms.
-- rowstore test1
Table 'test1_rowstore'. Scan count 1, logical reads 64, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 0 ms, elapsed time = 232 ms.
-- rowstore test2 - method 1 - dateadd
Table 'test2_rowstore'. Scan count 1, logical reads 240, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 62 ms, elapsed time = 219 ms.
now cold times - after instance restart
-- test1 - datefull
Table 'test1'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6291, lob physical reads 0, lob read-ahead reads 19317.
Table 'test1'. Segment reads 3, segment skipped 148.
SQL Server Execution Times: CPU time = 46 ms, elapsed time = 218 ms.
-- test2 - method 1 - dateadd
Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 2856, lob physical reads 0, lob read-ahead reads 10040.
Table 'test2'. Segment reads 4, segment skipped 147.
SQL Server Execution Times: CPU time = 31 ms, elapsed time = 253 ms.
-- test2 - method 2 - union all
Table 'test2'. Scan count 12, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 8565, lob physical reads 0, lob read-ahead reads 0.
Table 'test2'. Segment reads 12, segment skipped 441.
SQL Server Execution Times: CPU time = 31 ms, elapsed time = 299 ms.
-- test2 - method 3 - or condition
Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 106995, lob physical reads 0, lob read-ahead reads 367381.
Table 'test2'. Segment reads 151, segment skipped 0.
SQL Server Execution Times: CPU time = 1234 ms, elapsed time = 579 ms.
-- rowstore test1
Table 'test1_rowstore'. Scan count 1, logical reads 64, physical reads 3, read-ahead reads 60, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 32 ms, elapsed time = 191 ms.
-- rowstore test2 - method 1 - dateadd
Table 'test2_rowstore'. Scan count 1, logical reads 240, physical reads 3, read-ahead reads 236, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 47 ms, elapsed time = 269 ms.
December 27, 2019 at 1:57 pm
ZZartin wrote:/shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.
Is doing less typing your only concern when you choose a database design approach?
Not needlessly recreating the wheel is a concern. Especially when the proposed alternative is clunkier and harder to deal with than just letting a column store work as intended.
December 27, 2019 at 3:23 pm
Sergiy wrote:Frederico, do you have any test case to support your speculations?
I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.
The best place to learn about ColumnStore is from Niko Neugebauer. Niko has examples that show how the compression works.
Lordy. I've obviously not read all of that (129 items currently in his top level TOC) but the man does appear to have covered everything that one could ask. Thanks for the link, Grant. Bookmarked for sure!
--Jeff Moden
Change is inevitable... Change for the better is not.
December 27, 2019 at 3:27 pm
I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.
That's very true. And, sorry to say, I've never done a Yabingooducklehoo to see if such a thing exists.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 31 through 45 (of 46 total)
You must be logged in to reply to this topic. Login to reply