July 30, 2018 at 10:05 am
Hey Everyone!
I am new here, and have a question about "best practice" data warehouse design.
I am facing a requirement to keep a table of Sales Invoices and their associated Phase. My Sales fact table has 600,000 rows of unique Invoices, and I have a dimension table for Phases that has 98 total possibilities.
It is possible that a single Invoice could go through every single phase. It is also possible that a single invoice could only hit 30 phases.
I have done some testing and realize that, if each Invoice has an average phase count of 45, my new fact table will contain 27M rows, and will only continue to grow. I am processing an OLAP cube for reporting. The report requirements are such that I will need to end up with a representation of Invoice Number, Phase Name, PhaseDate. Invoices only ever move forward through Phases.
My table structure is:
CREATE TABLE factSales
(
InvoiceId int identity primary key not null
InvoiceNumber int not null
PhaseID int not null
PhaseDate datetime not null
)
CREATE TABLE dimPhase
(
PhaseId int identity primary key not null
PhaseName varchar(255) not null
)
My question is this:
Any thoughts on this are welcome! Thanks!
July 30, 2018 at 10:18 am
A little difficult to know what target we're aiming for here, as you talk about DW design best practice, but then talk about a report for which we have zero detail on what it's going to contain. Data warehouses are often "de-normalized" to avoid adding things that might be able to be categorized as "unnecessary complicating dimensions." There's no value in a 27 million row dimension that doesn't provide some kind of statistically significant value, and adding the phase name to the sales fact table that has the invoice data in it sounds like a reasonable idea as long as there's no longer-term value in having the phase be a dimension unto itself. That's not to say you couldn't do both, but only use the phase dimension when you really need it. As with so many things SQL Server, the answer is almost always, "it depends".
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 30, 2018 at 10:27 am
Sorry, I was going for brevity and probably needed to add more context.
Ultimately, the report will need to display the following:
My main concern is table size on the disk, and how quickly that will grow throughout the year. The fact table will be 27 million rows, the dimension is only 98 rows.
Would it be better to write a view to join the fact and dimension to display the required output? It would help keep data size low, but may cause performance lag when loading the report.
July 30, 2018 at 10:50 am
I'm having trouble with needing a data warehouse to create such a simple report. That kind of report just doesn't need a DW to exist. In fact if you parameterize the query to allow specifying a particular invoice, you've taken a need for the DW entirely out of the picture. If load on the production system for reports is too high, you can use replication to copy the live database to a reporting server. If I've missed your point, please educate me, because I can't find a reason to DW this scenario in the first place, whereas I can find tons of good reasons to trace an invoice through all it's phases.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 30, 2018 at 10:59 am
Thanks so much for your responses, and sorry if I haven't provided enough context.
A data warehouse already exists for reporting throughout the sales department. We've got 8 or 9 fact tables and about 12 dimension tables. This specific routing report is ancillary to the main goals of the organization.
My question really drives at wanting to keep data size to a minimum so as not to make this report a bottleneck, but also keep it responsive and limit processing time to the cube. I worry if I store the fact table for this Invoice Routing with 27 million rows of Invoice Number and PhaseName (int and varchar(255), it'll increase overall query time when the cube is processed.
July 30, 2018 at 12:02 pm
jklenk86 - Monday, July 30, 2018 10:59 AMThanks so much for your responses, and sorry if I haven't provided enough context.A data warehouse already exists for reporting throughout the sales department. We've got 8 or 9 fact tables and about 12 dimension tables. This specific routing report is ancillary to the main goals of the organization.
My question really drives at wanting to keep data size to a minimum so as not to make this report a bottleneck, but also keep it responsive and limit processing time to the cube. I worry if I store the fact table for this Invoice Routing with 27 million rows of Invoice Number and PhaseName (int and varchar(255), it'll increase overall query time when the cube is processed.
Not sure why it's all that necessary to push the name to the cube. The ID value is sufficient, as SSRS can have the cube result as one dataset, and the phase table as a lookup scenario, so I guess I'd need to understand the objective of having phase as a separate dimension. In terms of having 98 rows, it's only of value when it's tied to the Invoice table, so analyzing any of that data requires all the invoice rows anyway, so I still can't see that as it's own dimension - so how many rows are in the invoice table?
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 30, 2018 at 1:58 pm
sgmunson - Monday, July 30, 2018 12:02 PMIn terms of having 98 rows, it's only of value when it's tied to the Invoice table, so analyzing any of that data requires all the invoice rows anyway, so I still can't see that as it's own dimension - so how many rows are in the invoice table?
This is a great point. The current Invoice table has 600,000 rows in it. I arrived at the 27M number by allowing an average of 45 Phases per Invoice.
Let's strip the tables down to just what's needed for now:
Scenario 1: single fact table
Scenario 2: single fact table, dimension table
I guess my question is ultimately: Is Scenario 1 better than Scenario 2? My goals are to keep the footprint light while also keeping performance tuned.
You're right about the end game; I ultimately need to end up with 27 million rows containing Invoice Number and Phase Name. And this table will grow over time. I just don't know whether I should take up more storage space to gain an increase in performance, or keep this lightweight but have possible performance suffer during cube processing.
July 30, 2018 at 3:02 pm
jklenk86 - Monday, July 30, 2018 1:58 PMsgmunson - Monday, July 30, 2018 12:02 PMIn terms of having 98 rows, it's only of value when it's tied to the Invoice table, so analyzing any of that data requires all the invoice rows anyway, so I still can't see that as it's own dimension - so how many rows are in the invoice table?This is a great point. The current Invoice table has 600,000 rows in it. I arrived at the 27M number by allowing an average of 45 Phases per Invoice.
Let's strip the tables down to just what's needed for now:
Scenario 1: single fact table
- Invoice Number and Phase Name
- 27M rows at 4 bytes per InvoiceNumber + 2 for length is 6 bytes per InvoiceNumber. 181 characters max (the longest Phase Name) + 2 for length is a maximum 183 bytes. So total data row would be 189 bytes per row.
- At 27M rows, that's 42 rows per page, which is 642,857 pages, which is an additional 5.2 GB. I'd probably add a covering non-clustered index for queries.
Scenario 2: single fact table, dimension table
- Invoice Number and Phase ID
- Phase ID and Phase Name
- With an InvoiceNumber and PhaseId fact table, that's 12 bytes per row, which is 671 rows per page. 40,238 pages is 329 MB additional space
- The dimension table would be 6 bytes per Phase ID column + 183 bytes for Phase Name. 98 Rows at 189 bytes comes out to an additional 19KB in space.
I guess my question is ultimately: Is Scenario 1 better than Scenario 2? My goals are to keep the footprint light while also keeping performance tuned.You're right about the end game; I ultimately need to end up with 27 million rows containing Invoice Number and Phase Name. And this table will grow over time. I just don't know whether I should take up more storage space to gain an increase in performance, or keep this lightweight but have possible performance suffer during cube processing.
Okay, explain to me how you get to 27 million rows? Something is getting lost in my head on that piece. You currently have only 600,000 rows and the question is how you plan on populating the DW. Does the invoice table get a new row every time the phase changes? Or ???
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 30, 2018 at 3:11 pm
And a couple of other thoughts... I'm doubtful that even if you kept phase as a standalone dimension, I don't see much value in it. About the only way that has value on it's own is if the existing phase data suggests a fairly strong pattern that is considerably stronger than merely being statistically significant. The time frame of a DW is typically to span 5 to 10 years or more, and phase data is too likely to change pattern in that time frame in an entirely unpredictable way. I'm struggling with the idea that phase data is relevant in anything but in aggregate for a measure of what phase something is in after a fixed amount of time from invoice creation. For example, on average, what percentage of invoices are in each stage a given number of months from creation. As that can be reported without using a cube, I'm struggling with why it's needed in the DW, unless the DW is the only reporting source. I'd keep it out of any kind of dimension, because I'm not convinced that there's enough value to spend any resources on it at all. Once you explain how you're getting all those 27M rows in the DW, I might have a different perspective....
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 30, 2018 at 3:40 pm
Good questions. The 27M rows comes from the following:
The way I would populate the table would be to query the current Invoice fact table and join to a preexisting Phase lookup table:SELECT InvoiceNumber
, PhaseName
FROM InvoiceTable i
INNER JOIN PhaseTable p ON p.PhaseID = i.PhaseID
This will give me the option to either populate all 27M rows with PhaseName, or all 27M rows with Phase ID, and bring the Phase lookup into the cube as a dimension.
I'm struggling with the idea that phase data is relevant in anything but in aggregate for a measure of what phase something is in after a fixed amount of time from invoice creation. For example, on average, what percentage of invoices are in each stage a given number of months from creation.
July 31, 2018 at 6:08 am
jklenk86 - Monday, July 30, 2018 3:40 PMGood questions. The 27M rows comes from the following:
- 600,000 unique Invoices
- Average 45 stages per Invoice. Some will have significantly more, some will have significantly less. For example, some Invoices may only hit 12 Phases, some will hit all 98 (not common but it happens.)
This will result in 27M rows of data, with 45 rows per Invoice based on that average. The end result of this report will be to query against this table to get the routing of an Invoice and the dates or other measures involved.The way I would populate the table would be to query the current Invoice fact table and join to a preexisting Phase lookup table:
SELECT InvoiceNumber
, PhaseName
FROM InvoiceTable i
INNER JOIN PhaseTable p ON p.PhaseID = i.PhaseIDThis will give me the option to either populate all 27M rows with PhaseName, or all 27M rows with Phase ID, and bring the Phase lookup into the cube as a dimension.
I'm struggling with the idea that phase data is relevant in anything but in aggregate for a measure of what phase something is in after a fixed amount of time from invoice creation. For example, on average, what percentage of invoices are in each stage a given number of months from creation.
- Report on Invoice Routing to determine how we ended up with the Final Invoice Amount. If any adjustments are made, where were they made and was that correct. Right now there's very little visibility of that.
- Report on time spent between Phases, identification of bottlenecks between departments, etc.
- Report on time current-state and point-in-time Invoice count per phase (aggregate).
Okay, but what I was getting at was how the data is stored in the Invoice table. Unless you are tracking all the changes to that table that occur during the day (and for that I have to assume that the current Invoice table just keeps the current phase), and holding on to them, you've got no way to query a current state Invoice table AND have the data to populate the DW. I have to believe that an invoice could easily pass through quite a number of phases in a given day.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 31, 2018 at 6:54 am
sgmunson - Tuesday, July 31, 2018 6:08 AMOkay, but what I was getting at was how the data is stored in the Invoice table. Unless you are tracking all the changes to that table that occur during the day (and for that I have to assume that the current Invoice table just keeps the current phase), and holding on to them, you've got no way to query a current state Invoice table AND have the data to populate the DW. I have to believe that an invoice could easily pass through quite a number of phases in a given day.
Yes, I see what you're saying now.
So yes, we do track changes, through another set of tables not included in the data warehouse. The current Invoice fact table gets updates twice daily on what phase it stands in. The new report would pull from that change table as well as the Phase lookup table to combine all Invoices, Phases, and Date/Times. The reason we do not store the change table in the data warehouse is due to size.
July 31, 2018 at 7:19 am
A couple of things here:
Not sure why you would be dissuaded from putting the data in a data warehouse. You could create the report from the operational system, but it would have an effect on that system.
The basic process you're describing is an accumulating snapshot report. This would normally have a single row per invoice with the current phase in the record. That row is updated as the row moves through the phases. It's not uncommon that some or all of the phases might have a separate column for the date.
You have a lot of possible phases, so this may not be possible. Given the layout that you proposed, having a fact table at the granularity of invoice/status change may be necessary. As you point out, it will have a lot of rows. You may want to add a metric for time in that status and have some indicator if the status is the current one. That doesn't exclude a fact at the granularity of invoice with only the current phase and columns with the key dates. The date dimensions can be set up as role playing dimensions. This kind of fact table also has to be able to deal with update, unlike the more common transactional ones (like the invoice/status change granularity fact table would be).
July 31, 2018 at 7:31 am
One more thing:
If the report intention is to report on individual invoices, as your example indicates, than that might be better done through an OLTP system as another poster said earlier. Data warehouses are better at aggregating large volumes of information to detect patterns (how long on average is an invoice at a particular stage) and identify outliers (this invoice has been open for a year--which is it and what's wrong?). If you need to identify the history of an individual invoice, that is better suited to OLTP reporting. With apologies to the previous poster who said something similar.
July 31, 2018 at 8:46 am
RonKyle - Tuesday, July 31, 2018 7:31 AMOne more thing:
If the report intention is to report on individual invoices, as your example indicates, than that might be better done through an OLTP system as another poster said earlier. Data warehouses are better at aggregating large volumes of information to detect patterns (how long on average is an invoice at a particular stage) and identify outliers (this invoice has been open for a year--which is it and what's wrong?). If you need to identify the history of an individual invoice, that is better suited to OLTP reporting. With apologies to the previous poster who said something similar.
Yeah, so I think that may have been me actually. We do have an OLTP Cube built that feeds from the data warehouse.
I think ultimately, given everyone's answers and just talking about the different data sizes, I may stick to a fact table / dimension table scenario. Since the OLTP Cube stores data by highest cardinality, if I keep my fact table to Invoice Number and Phase ID, the data warehouse won't grow much. Using a view to populate the Cube with Invoice Number and Phase Name, the Cube storage won't take a significant hit, since we're already storing Invoice Number for other tables.
If any of this sounds incorrect feel free to call it out, but I think this is the best bet. I can add covering indexes to my table to try and keep the view performing well.
Viewing 15 posts - 1 through 15 (of 30 total)
You must be logged in to reply to this topic. Login to reply