April 1, 2019 at 12:46 pm
Hi,
I wanted to get some views on something. We have a data load process that leaves data into a reporting table, the table is wide (about 300 columns) and it comes from an external source and changing it is not an option due to regulatory requirments.
At present its a few thousand rows but its expected to grow up to about 5 million or so over the next few years.
It gets more interesting when its how we plan to use, it, it'll have 2 main usages, the first will be PowerBI importing the data, all of it. The second will be more report specific queries that we probably write to get one or a handful of sets of data.
This got me think about if its a good candidate for a column store index so I whacked one on it and the size of my data decreased from 400MB to 30MB, so definitely space being saved and my queries to select all the data back are using the column store index but the actual execution time remains at 4 seconds.
So my question is, is it worth me doing the CSI? I'm leaning to yes as they save space and are being used in the execution plan for when I want everything but when I say want a specific set of data the IO cost is much less is I just create a clustered index on that column and use that.
Any pointers or words of advice, as I say I'm leaning towards using the CSI as about 80% of our workload will be PowerBI saying give me everything.
What do you think?
Thanks,
Nic
April 2, 2019 at 9:40 am
So, my current thought processes around this are simple (most of my thought processes are simple). Is the primary use of the data aggregations & large sets or is it point lookups & small sets? If the latter, traditional indexing. If the former, clustered columnstore is the way to go. It's a night and day difference. If you also have to support some point lookups, you can always add a nonclustered index to the clustered columnstore.
I don't see whether your used a clustered columnstore or a nonclustered on a traditional table. Based on what you've said about the system and the queries, I'd lean heavily towards the clustered columnstore. The big issues there are ensuring some regular maintenance on the index. As you make changes to the data, it won't be automatically stored in the pivoted, compressed, columnstore. Instead it goes initially to a delta store which is just a b-tree. The best performance is when all the data is in the columnstore and little to no data is in the b-tree, so rebuilding the indexes is a bit more important than it is with traditional stuff. Also, keep the statistics up to date. They still matter.
Other than that, sounds like you're testing appropriately and evaluating for the right measures.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
April 3, 2019 at 6:01 am
Hi Grant,
Thanks so much for the reply. Its good to know I'm on the right road.
The tricky bit is not knowing how the data will be used in more detail, based on what I know now I'm going to go with the clustered column store and then I'll add non clustered indexes as and when they are needed and we get more details on how the business intend to use the data past what we know now.
Thanks for the reminder on the delta store as well, to be honest I had forgotten all about that, an index rebuild on the affect objects after the load is done should suffice.
Many thanks,
Nic
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply