February 16, 2005 at 2:33 pm
I have a set of 27 reports (queries) and the response time for the reports has considerably slowed as the database has grown. The reporting system response time is directly related to the speed of the underlying SQL statement. Circumstances be what they are, the reports are derived almost exclusively from a single flat file (not created here!) that is far from relational, has lots of nulls, and due to its structure, lacks any meaningful unique identifier. When the database was around 100,000 records, the response time was good without having to do much of anything. When it grew to one million, response times ranged to upwards of 5 minutes. I ran the Index Tuning Wizard and got most queries to less than 5 seconds and nothing worse than 20 seconds. When the database grew to 3 million, the response time shot back up, so I ripped out all the Unicode data types. This helped. Now, I have to do something serious because the database will grow to 5 million records (max). The data table is a data warehouse in that the data contained within does not change. Currently, 100,000 records are added each week.
The programming strategies that I am evaluating are:
(1) Convert the existing single table flat file design into a normalized, multi-table, optimized relational design with referential integrity and single field clustered indexes, then transfer the data into the new structure. This would involve writing a new structure, converting existing databases and rewriting the SQL statements to work on the new structure. I might also need to address how new records in the flat file format get appended to the new structure.
(2) Create separate tables that exactly match the report definitions and create stored procedures that create/update the tables from the single flat file on an automated scheduled basis. I would update the SQL statements in the reports to read from the respective new tables. The advantage to this strategy is that the queries for the new tables are practically written (convert the SELECT’s to INSERT INTO’S based on the existing SQL statements), and only the reports that are really suffering need to be restructured. Disk space is not an issue, and the development can be incrementally addressed by attending to the worst performing reports first.
(3) Cubing. The data is the detail records of weekly invoice data. Thus, I estimate that the data could be sliced by Invoice Date (weekly) or monthly. Each weekly invoice contains 50 accounts; that might be another slice. Think of a phone bill that has 100,000 calls per week under 50 different phone numbers and you need to report on the detail of each call, grouped by phone number and invoice date, etc.
Unfortunately, I have no way of knowing whether any of these methods will create split second response time other than taking the time to do them.
Any recommendations which way to go? Is there another strategy?
February 16, 2005 at 2:41 pm
If you can then normalize... it's always the best option on the long run. But that would also mean rebuilding everything in the application.
February 16, 2005 at 2:53 pm
>>If you can then normalize... it's always the best option on the long run.
... for a transactional OLTP system. If the primary requirement is responsive reports, then resolving to full 3rd normal form (or better) may cause reporting queries to contain so many joins that performance gets worse. This is unchaging data, that gets bulk-loaded with a batch of new records each period, so designing for performance of singleton CRUD operations should take a back seat to designing for sucking data out as fast as possible.
I would start by trying to understand the data - what data in this large flat table would be considered a "fact" that changes every load (eg account dollar balances, units in inventory, numerical measures), and what would be considered a fairly static dimensional descriptive attribute (eg supplier name, customer's account number, geographic data of suppliers/customers).
Then read a primer on dimensional modelling (eg: http://www.intelligententerprise.com/030101/602warehouse1_1.jhtml) and try to move the dimensional attributes (that your reports hit with WHERE, GROUP BY and ORDER BY) into dimension tables that are an order of magnitude smaller than your table of facts & measures (that your reports hit with SUM()'s AVG()'s and aggregations).
February 17, 2005 at 7:53 am
would it be possible to seperate the data into multiple tables. for example by each year/ quater, etc. that would give you a couple of options.
1) create a view that "reassembles" that data from the seperate table. that would allow you to hide the underlying structure from the reports.
2) if the report use stored procedure you could then modify te stored procedure to call other SP's that look into the different tables.
I've never done it, but you could also spread the data across multiple server is you like. you would just link the servers.
Curtis Smith
SQL Server DBA
Well in worked in Theory ...
February 18, 2005 at 5:07 pm
First, please confirm:
A previous poster suggested segregating the data by time period. Given that you note that the data will reach a maximum of 5,000,000 records, and your weekly addition of 100,000, I assume that this will ultimately either restart every (fiscal?) year, or end up being a running tally of the last 12 months' worth of data.
Therefore, further segregation by time period may not be practical/possible.
Second:
When are the reports run? Throughout the week, or right after the update hits? Are the reports run multiple times? with different selection criteria?
Assuming that the reports are run multiple times with different selection criteria, pre-built, report-specific tables (your second alternative) could well be an excellent solution.
It might not hurt to revisit the details of the reports as well; make sure that all the data going into the pre-built table is needed (now, or in the foreseeable future), and that there is no desire for additional data.
Delivery mechanism can come into play here, as well; if the reports are pasted into Excel spreadsheets, for example, you might see if setting up a pivot table (possibly with an underlying, offline data cube) would meet your users' needs. This moves the work off the server, so that you don't have any contention issues causing further slowdowns (plus does the sort of data re/preformatting you're talking about).
Finally, speaking of contention issues: Especially if some of the queries are convoluted, you might want to be sure you are employing the WITH (NOLOCK) hint. I have seen occasions where complex SELECT queries have locked data so no one else could run a query (even another SELECT!). Never did figure out the details, but it happened more than once.
Hope this is some help to you.
R David Francis
February 21, 2005 at 7:57 am
Thanks to everyone for the excellent feedback. In response to questions posed by RD Francis.
Segregation by time period. The database is a “rolling” twelve months. As a new week is added, the data that is over a year old is deleted. I thought about splitting the one huge table into multiple databases (like 3 month segments) but not yet convinced it is the way to go.
Since it is a web-based reporting system, the reports are run by the clients during the common work week (8-5 Mon-Fri’s). This gives us the availability of almost every overnight time period to run sp’s. The only selection criteria is a date range.
I am leaning towards developing the normalization structure and testing that for performance. I believe it will create a good foundation and baseline for further enhancements.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply