May 29, 2013 at 1:46 am
Backgrounds:
I have a big table about 180G for size and 4 billion rows on SQL server 2008R2 64bit so far, and growing much very day.
So far, we partitioned source DB table by ColumnA, and we have 3 distinct count on ColumnA, ColumnB and ColumnC, then we have 3 measure groups(as some sqlserver&doc required), for example, MG_A, MG_B and MG_C for each distinct count, and we want to partition MG_A, MG_B and MG_C. The final goal is that we can build each measure group partitions incrementally.
Goal:
Better performance for building cube with these 3 distinct counts on the same source DB table.
Question 1:
For MG_A, ColumnA is OK, source DB table and MG_A are all partitioned by ColumnA, but how about MG_B and MG_C?
If source DB table is partitioned by ColumnA, and MG_B is partitioned by ColumnB(this is recommended by some docs), then these two partitions are not aligned one by one(Alignment is required by some doc), then any impact on performance?
Question 2:
Any idea for better performance of cube building?
Since 3 distinct count is based on the same source DB table, any chance to improve the performance of cube building? otherwise, 3 times work to do just as 3 distinct counts on 3 DB tables.
Question 3:
If we partition measure group NOT by distinct column, for example, partition MG_B by ColumnA, instead of ColumnB, is this a correct solution? That means, is it mandatory that partition MG_B by ColumnB?
Question 4:
If we partition measure group by distinct column, any chance for only incremental processing? For example, just build the latest partition? Otherwise, we have to re-build all MGs from the scratch. It's too slow.
Because I saw some doc said for distinct count aggregatonfunction, you have to re-build all partitions.
I know too much questions here, but one depends on another one.
All in all, any suggestions about distinct count and incremental cube building?
Thanks.
May 30, 2013 at 2:48 pm
Any way to design the underlying tables so you don't need 3 distinct count measures?
1. MG_B and MG_C can be partitioned on column B and C even if the table is partitioned on ColumnA. In this case, you will face some performance issues since the query for MG_B and MG_C will require pulling data across multiple partitions. The performance difference of processing between MG_A and MG_B will be the close to performance difference of running a select distinct query across ColumnA and ColumnB. SSAS works by first querying the data, then aggregating it. Aggregation time for MG_A and MG_B would be somewhat consistent, but the query time will change.
2. Make sure the partition query is optimized so you are not selecting fields that are not used. MG_A only select ColumnA and dimension keys. MG_B only select ColumnB and dimension keys. etc...
3. If you want to partition a distinct count, it must be on the distinct column. The reason is that the distinct counts will be summed across partitions. If you partition MG_B on ColumnA and a value for ColumnB exists in more than one partition, you will be double counting that value.
Ex: ColumnA: 1, ColumnB: 2
ColumnA: 2, ColumnB: 2
ColumnA: 2, ColumnB: 3
Lets assume you have two partitions based on ColumnA. Partition1 is where ColumnA = 1. Partition2 is where ColumnA = 2.
Distinct count of ColumnB in Partition1 is one. Distinct count of ColumnB in Partition2 is two. In the MG_B, when you analyze the data, you will end up with the distinct count equaling three, one from Parittion1 and two from Partition2.
4. You wouldn't be able just build the latest partition. Once you have the measure groups partitioned, you only need to reprocess partitions that are affected by updates in your data. In my SSAS solutions, I always create a SQL table that maintains all the partition definitions for all measure groups. During the data processing of my DW, I'll flag which partitions are affected and then have a process run through the table to processes specific partitions.
Efficiently processing large SSAS solutions in my opinion is a bit complicated and MS has not provided a great way to manage it yet. I've only used MOLAP in my SSAS solutions and create custom processing methods to make it efficient.
I start by creating tables to maintain SSAS dimensions and SSAS measure group and partitions. As my DW is being processed, it'll flag which dimensions need to be updated and which partitions need reprocessing. The latest version also creates new records in the partition table when a new partition needs to be created. I have an SSAS process task that runs after the DW and it checks these tables to build out the XMLA to only update the dimensions that have changes. The SSAS process also creates/alters partitions and processes the ones that need reprocessing. All of this is built using SQL and SSIS. If I get enough requests, I'll do a full write up on this.
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply