March 2, 2017 at 11:18 am
Greetings all,
I am attempting to devise the best solution for the scenario I have at hand.
I am attempting to take a current set of processes and data, and reduce it into a dimensional model for ease of use, and refine the overall business processes, and feed into some tableau reports.
The core data set for this, is multi-faceted. I will see if I can explain better :). The customer is doing population based statistical analytics. So the data source, that contains the records they are doing this on, is parsed into population categories, we will just call them pop a, pop b, and pop c for clarity. Each record within the source can be a part of n number of populations, and there will be more populations defined in the future.
Each population has a very specific set of business rules and parameters, so this reduction is not possible.
I am faced with 2 possibilities; doing a similar process that they are doing now, and create a large fact table that is essentially a union of all the population types ,with a key that is source event/ population type. Or, I could create a fact table for each population type.
The second choice would create upward of 100 fact tables with shared dimensions. They do however, have combined metrics for general population statistics that span all populations, so the second idea seems less than ideal.
I am wondering if I am missing a solution here, or if there would be a better way to approach this?
My concern with one large fact table, is that the only way to load this, is to either create one large union statement to handle inserts/updates, or create one load each. I fear that this would not be easily maintainable going forward, as populations are added.
Any ideas on this are much appreciated! Either a different solution/approach altogether, or a way to handle the iterative approach dynamically without creating a maintenance nightmare!
Link to my blog http://notyelf.com/
March 7, 2017 at 6:19 am
Can you give some samples of what the data looks like across a few populations? Also, how many records on average live in each population? Then maybe how the data will be read/queried for analysis. For example, will the end user query a population at a time or query multiple populations where that union is needed?
If they are all sharing the same dimensions and have similar metrics, then it makes sense to maybe keep them all together rather than trying to span them across a 100 or so fact tables with either separate dimensions or sharing dimensions. I only say that due to how many facts you will have to manage here. I don't think it's a problem to say have two facts, one for conversions and one for non-conversions that share the same dimensions. But, you are talking about 100 tables here.
So, how large are these facts to make you want to split them? I have fact tables for example that have hundreds of millions of records in them across different partitions of my data. Sure, I could split these out into many tables based on that partition, but report queries will read across those partitions with no filters on those partitions. The main filter is normally time, which is what the data is clustered/partitioned on the most.
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply