December 18, 2003 at 9:30 am
I've got a client that wants me to use the same table to store multiple levels of aggregated data. For example: Sales by client by city and Sales by client by state in the same table with an identifier to show the row is either state level data or city level data. I think this is a poor design but he's convinced that with the proper indexes, selecting pre-aggregated data will be just as fast as computing it on the fly will be through a view, which is how I want to do it. The table will have multiple millions of records if not billions eventually and contain about 40 columns of numeric data. I don't have a lot of experience with large data sets, so I'm asking the community for your opinions based upon your experience whether it will make a difference. And if so, which will be the more responsive approach.
At this point, I'm ready to build a table, populate it with dummy data and test it to see which one works best.
December 18, 2003 at 9:42 am
A good option may be to do "vertical partioning"... Create separate tables for data, then create a VIEW combining the tables. This is especially nice in SQL Server... When creating a table with row level constraints such as [SalesDate] between '1/1//00' and 1/31/00', SQL Server will "drop" the table from the query execution plan when the view is queried for records outside of the date range. So making table on various Date and / or aggregation levels, then making a VIEW could be very performance benificial.
Once you understand the BITs, all the pieces come together
December 18, 2003 at 10:00 am
I'll consider it. I see it useable with either design approach.
December 18, 2003 at 11:03 am
If I were you, I'd make a view. Even if for right now it's just SELECT * FROM Table. This way you can more easily test various query senarios by just altering the view, and wait for other responses and try them out without having to change the client side code.
Once you understand the BITs, all the pieces come together
December 18, 2003 at 12:21 pm
Do you need to maintain the relationships between the levels of aggregation? I mean, will the total sales of city1, city2 ... cityN in state1 add up to what is stored for state1? If so, the integrity constraints you will need to create is to me a more important factor than what performance you might get with proper indexes. I don't think this is a good design, I would use a view just as you suggest.
--
Chris Hedgate @ Extralives (http://www.extralives.com/)
Contributor to Best of SQL Server Central 2002 (http://www.sqlservercentral.com/bestof/)
Articles: http://www.sqlservercentral.com/columnists/chedgate/
December 18, 2003 at 12:59 pm
Do you need to maintain the relationships between the levels of aggregation? I mean, will the total sales of city1, city2 ... cityN in state1 add up to what is stored for state1?
Yes.
December 19, 2003 at 2:43 am
Then, as I said, I would never go with the design proposed by your client. Tell him that if he wants it you will have to create extremely difficult integrity constraints, if they are even possible to create at all, which will very likely have a bad effect on performance and since they are so complex they could very well fail because of unforeseen bugs. And failed constraints means corrupted, non-consistent data, so why use a database in the first place if he is willing to risk that?
--
Chris Hedgate @ Extralives (http://www.extralives.com/)
Contributor to Best of SQL Server Central 2002 (http://www.sqlservercentral.com/bestof/)
Articles: http://www.sqlservercentral.com/columnists/chedgate/
Edited by - chrhedga on 12/19/2003 02:44:44 AM
December 19, 2003 at 6:39 am
Hi, just a quick comment, since it seems that the purpose is dealing with historical data or analysis at least. You may want to approach the task from 'data warehouse' perspective with fact(s) and dimension tables - a good book on datawarehousing (or sections on MS SQL Server Analysis Services) may provide useful hints.
Best regards.
December 19, 2003 at 8:49 am
Storing multiple levels of aggregation is a good idea. Storing them in the same table is a bad idea. Each table should have a consistent "grain". One of the problems you face, otherwise, is overaggregation when the user (or application) fails to apply the appropriate constraint on the query. While the single grain rule will result in more "fact" tables, each table will be smaller, easier to maintain, and BI tools will be able to navigate to the appropriate fact table without the user having to worry about what table to query.
Kurt G. Wolff
Kurt G. Wolff
December 19, 2003 at 6:27 pm
After doing some performance testing against a sample database of a million, records I found indexed views are as fast as pre-aggregating the data. So, my proposal to my client is to store the data just at the finest level of granularity and use indexed views for the other levels of granularity. I have but one final question: How do you set up an indexed view that will allow me to get standard deviation data from it? BOL wasn't to helpfull. TIA.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply