December 25, 2019 at 4:54 am
Columnstore index is a workaround for denormalised data storage.
follow basic data normalisation rules and there won’t be any place for columnstore indexes in your database.
not to mention - you’re gonna eliminate the latency coming with columnstore index refresh
_____________
Code for TallyGenerator
December 25, 2019 at 5:42 am
Second, aggregates. Analysis. These are where they shine. If you're not seeing GROUP BY in the queries, I wouldn't suggest using them.
So, not good for general history/audit tables (which are usually the largest tables in any database both for rowcounts and for pagecount)?
{EDIT} And, I meant that as a question, not a statement.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 25, 2019 at 11:32 am
I would still consider them good for history/audit tables as these aren't queried often - saving on space is what would really be important here, not speed of processing the data.
December 25, 2019 at 12:37 pm
Grant Fritchey wrote:Second, aggregates. Analysis. These are where they shine. If you're not seeing GROUP BY in the queries, I wouldn't suggest using them.
So, not good for general history/audit tables (which are usually the largest tables in any database both for rowcounts and for pagecount).
Yeah, it's about the access patterns, not the size of the storage. If you primarily see point lookups or very limited range scans, columnstore will suffer badly. Broader scans and aggregates, especially aggregates, are where it shines.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 25, 2019 at 12:39 pm
Columnstore index is a workaround for denormalised data storage.
follow basic data normalisation rules and there won’t be any place for columnstore indexes in your database.
not to mention - you’re gonna eliminate the latency coming with columnstore index refresh
I really can't agree here. It's a completely different kind of data storage and access. It's not the same as denormalization at all. It's certainly not a workaround for it. Instead, it's an access pattern and data storage pattern that is completely different than the row store. It lends itself to analysis and aggregation style queries. While it's certainly not the same as a your average normalized data structure, neither is a good star schema, yet, that's not denormalized or a workaround either. It's a different pattern solving a different problem.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 25, 2019 at 12:40 pm
I would still consider them good for history/audit tables as these aren't queried often - saving on space is what would really be important here, not speed of processing the data.
No arguments... if they're not queried often.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 25, 2019 at 3:13 pm
I would still consider them good for history/audit tables as these aren't queried often - saving on space is what would really be important here, not speed of processing the data.
saving on space happens only if you have repeating data in the storage.
Which is a violation of normalisation rules.
Which proves my point.
parsing inbound data and storing it in a proper relational manner, instead of dumping it in a "whatever" column, would: a) save on space, b) speed up querying.
columnstore index is just an attempt to do that parsing, which developers been too lazy to do, behind the scenes, trying automatically figure out the entities and aggregate the data by those entities. Which is prone to errors and less effective than a proper inbound data processing.
_____________
Code for TallyGenerator
December 25, 2019 at 3:58 pm
saving on space happens only if you have repeating data in the storage.
Which is a violation of normalisation rules.
You've just described the very nature of history/audit tables. They are the epitome of highly denormalized tables by nature. This is especially true with "column based" history tables where only the columns that have changed are recorded with one row per "cell" of change.
As you correctly stated a long time ago (and I tend to live by it when it comes to SQL Server), "A Developer must not guess... a Developer must KNOW" and there's only one real way to know... I've got to setup a demonstrable, repeatable test for the history/audit tables that I'm considering using Column-Store on. I certainly have the data to pull such a test off.
My biggest problem (no pun intended) actually is one of those Column-Based history tables. It's huge. But, even if Column-Store were the perfectly logical choice, it's one of the worst physical choices for me because it contains two SQL_VARIANT columns that queers the use of Column_Store on the table.
I'd rather fix the tables, though that's not likely to ever happen.
On the other stuff and having ridiculously wide denormalized legacy tables that everyone is afraid to normalize, I certainly agree that Column Store could provide some relief from the sins created by previous lazy or uninformed developers. To add insult to injury for space used, they (the original developers, not the current ones) also used indexes as a substitute for more narrow, normalized tables.
As a bit of a sidebar and thinking out loud, those two columns contain the "Old_Value" and "New_Value" columns (and the "Old_Value" column being present is something I also fought against and lost on even though its presence is both outright stupid and a total waste of space). I told you that because converting them to a common allowable datatype, such a VARCHAR(), might cause the tables to become even larger. I need to test that and then test if Column Store would actually make up for it and see what happens for performance when the queries that run against it occur.
Or, I could say "screw it" and leave it as a Page Compressed table, which actually did already help a ton.
Hey, Grant... that might be a part of why people don't adopt Column Store... once they dig into the limits and usage (even with the recent needed improvements), they might just flat out think it's not worth it. Do you know of any good articles that contain a good comparison for history/audit or other tables?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 25, 2019 at 5:40 pm
Columnstore index is a workaround for denormalised data storage.
follow basic data normalisation rules and there won’t be any place for columnstore indexes in your database.
not to mention - you’re gonna eliminate the latency coming with columnstore index refresh
Not quite - look at a DW and a fact table - it will be "normalized" - e.g. 1 fact to many dimensions and a few "value" columns - not only the dimension keys will be highly repetitive but the "values" will also tend to repeat themselves - making it a very good target for columnstore as its usage will also be mainly based in aggregate operations.
December 25, 2019 at 6:42 pm
I just had a sudden flashback to the two threads on "bankers rounding". 😀 I'll get the popcorn. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
December 25, 2019 at 10:02 pm
Frederico, can you please describe the mechanism a columnstore index uses to compress the data?
_____________
Code for TallyGenerator
December 26, 2019 at 10:50 am
Frederico, can you please describe the mechanism a columnstore index uses to compress the data?
for Archival compression it uses Microsoft XPRESS
for columnstore compression the engine uses the vertipaq/xvelocity algorithm
http://rusanu.com/2012/05/29/inside-the-sql-server-2012-columnstore-indexes/
https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3
although mainly based on SQL 2012 this is quite an interesting paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.479.8436&rep=rep1&type=pdf
December 26, 2019 at 2:18 pm
Can you please put it in your own words - how do you see it?
_____________
Code for TallyGenerator
December 26, 2019 at 2:54 pm
Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly. So my reads might be of
1, 1, 1, 1, 0, 0, 2, 1, 0, 1, 0, 1
instead of a read that might be
{2342, Steve, 1, Jones}, {5323, Andy, 1, Warren}, etc.
Even with an index, I might be reading.
{2342, 1}, {5323, 1}, (5543, 1}, {5532, 1}, {6454, 0}, etc.
This can dramatically increase the performance of reads during queries. This isn't denormalized, but rather a restructuring of data into columns rather than rows. We often query for aggregates on columns, so putting those together just makes the read faster. This is indexing on repeated data in columns, which often isn't a part of any key structure.
Of course, there is the penalty of maintaining this index, but that's on writes/updates, which are often far fewer than reads. Even in OLTP systems, we often read more than write. In BI systems, or in reading back audits, it's a greater discrepancy. This is a trade of computational time during updates for less computational time during reads, using additional storage. We make trades like this all the time in computer science, especially in the BI space where we are trying to pre-compute or pre-organize data to make read queries faster.
December 26, 2019 at 4:09 pm
Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.
Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?
_____________
Code for TallyGenerator
Viewing 15 posts - 16 through 30 (of 46 total)
You must be logged in to reply to this topic. Login to reply