High Data Volume

  • Hi All

    I am working on a new application that needs to handle rather high volumes of data. The structure basicly holds a list of people and information about them. The business requires this list of people to grow to 500,000 and over.

    Due to the information requirements about each individual changing from implementation to implementation the structure was designed to use meta data to structure the details and the actual data is all stored in one table. At 56,000 individuals this table had 18,000,000 records in it. +- 320 data items per individual.

    At this point the database performance was ... well ... non existent.

    After adding and tuning several of the indexes and cleaning up from an overnight mass population run the application became responsive, but to no acceptable level and also only with one user.

    My question now is, what are the best and biggest performance gainers in this kind of scenario. I know there are many and they overlap and some actually hamper others and there is a fine balance that needs to be reached.

    1. Indexing, how much performance does this bring at such volumes. Possibly 100,000,000 in the future.

    2. Segmenting the data into separate tables, does this actually help, since the data is still in one database.

    3. Is there a performance difference in having many simultaniouse queries on one huge table or fewer queries on several tables of the same size?

    4. If I segment my data into different tables, will this give adequate performance or no real change as now the server needs to jump between table to return the data.

    5. Hardware. How much hardware do I need to throw at the problem to achive acceptable performance.

    Sheww, what a mouth full. Basicaly I just don't know where to start and don't want to waste my time with trial and error and would like some expert advice out there.

    I've attached a diagram of the tables in question. The yellow tables hold the meta data describing the structure of the information, and the blue tables hold the actual data. The table with 18 million records in it is called "relCandidateFieldValue".

    If you have any questions let me know and I will post any details you require.

    Thanks to All.

    Jens

  • Basically, what you have is a version of the "one true look-up table". Lots of developers and DBAs have this realization that they can store meta data in one place, and then all of their actual data in another, and only have to keep data in two tables. It always, every time, becomes a performance and maintenance nightmare. It's one of those "it seemed like a good idea at the time" things.

    The only real solution is to move the data into normalized tables that keep their metadata in the table definition. Hate to say it, but it's true. You will never achieve good performance till you do that.

    Till that can be done, partitioning the big table might give you a bit of a performance increase. Maybe. Depends on how your data would be segregated and then how it will be used.

    SQL Server has no problem at all dealing with hundreds of thousands of records. What you're looking at is a medium-sized database. No big deal if it had been built correctly in the first place.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Actually, after re-reading your post, I realized you currently have 56-thousand records and might expand up to 500-thousand. I thought you already had 500-thousand. That means you have what should be a small database that might grow into a medium database.

    I still recommend normalizing the data structures, and seeing if you can get a restraining order placed on the original developer that restricts him from ever coming within 100 yards of any database ever again.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared is absolutely right: you need to create a normalised structure out of that mess. It's the only way to go. Preferably get someone who knows what they are doing too, unlike the original developer. It shouldn't take long to devise a decent structure and populate it with the current data, maybe a few weeks work, a month at most.

    GSquare is also right about getting a restraining order for the original developer!

  • Thanks Guys.

    I was fearing this ... and yes, the normalised approach of creating a table for every type of record (Personal Details, Educational History, ... etc) to be stored will indeed result in less records for each of these tables.

    The problem comes in that the "data structure" should not be fixed, and once installed the owner / operator of the system should be able to change tha structure via a simple interface. This is why the meta data / storage data approach was choses. All that needs to be changed is the meta data.

    I guess we can change the design to have the system automatically ALTER / CREATE tables as the administrator changes the structure, but I'm worried that this will actually cause other or even more issues than we have now. Specially when it comes to having to maintain relation ships, data integrety and indexes.

    The meta data approach also allows the administrator to change a field from a "string" to an "int" without any conversion or processing happening. All that will happen is that existing records will fail validation the next time they are edited and saved. Any new records will require integer input. I know this might leave us with "stupid" data but the nature and source of the data unfortunatly dictates that the data will never be perfect and clean.

    Any last few words before I crawl into my cave for a few weeks?

  • I have to say I've never come across a system where people are changing the actual structure of the database itself on a regular basis.

    However, it's not a complete show-stopper. If you are able to identify which fields are most likely to change you could write scripts that do the data conversion on the existing data, or even use SSIS packages that could dump the current table to file (SSIS has a native raw file format that is very fast for I/O), then run the ALTER table command, and then fire another package that reads the dumped file back in - you'd need some scripting though that altered the read package so it knows which fields to run conversions on.

    Whatever you do to automate it is going to require some clever coding...could take quite a while. Good luck!

  • One solution would be to split out the common and standard data, and make those into tables, where the end-user does NOT have access to modify the structure, then add one table that stores XML schemas for customized data structures, and another that stores the custom data as XML, along with a schema ID.

    For example, it's pretty much a given that, if this is storing data about people, you can have a table with a name column, you can have another table with phone numbers, another with e-mail, and yet another with addresses. All are tied together by a PersonID (or UserID or AccountID) or whatever is appropriate for that. If 90% of the people using the database will need gender and date of birth, include those. Internationalize those as needed, then make the structures static and write code specifically for CRUDing those things.

    Then, when one of your users needs to add a "Dogs Owned" "table", that gets added to an XML schema with data they want (maybe breed, age, name, mother, father, gender, veterinary history), while another user who has no use for that data, but needs to record student data, can add that in the same manner.

    At least the most common data will perform better, while only the custom data will be a pain. Also, XML can be indexed in SQL 2005, including nodes and values, and can then be queried pretty efficiently.

    Basically, normalize all the usual stuff, then use XML for the custom stuff. It will almost certainly be better than one-true-lookup. It will never perform as well as a database that's actually set up for the individual needs, but it will be better than the current solution.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks for the replies.

    Making some of the data "static" is deffinatly something to concider, however I have the feeling it will not remove nearly enough data from the look-up structure.

    Our last test included 360 lookup values per individual (56000 * 360), of those 360, we could possibly make 50 static. Any more and customers will complain about the data not being relevant to thier purpose.

    Ultimatly the problem stems from out customers all having common, yet sublte differences in thier data requirements and processes they follow. We thought of making the sloution a one shoe fits all, but the market did not take kindly to this and as always the big clients crack the whip and we jump.

    Are there any performace expectations I could have on a table of 20,000,000 records ... Or will this volume always no matter what have performance issues?

    I might be expecting to much of SQL and indexes here, but I was hoping that 20mil - 100mil should not be all that much of an issue. I'm just wondering what kind of hardware I would need to put in place to handle it adequatly. clearly there is a limit to what hardware can help and at some point hardware will not give much improvement anymore.

  • While XML may be a solution to this particular design issue, I would take a good look at SQL 2008 and it's Sparse Columns feature. I really bet that is a winner here, and sounds like exactly what it was designed for.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • SQL Server can handle 20- to 100-million rows no problem, so long as they are well-designed.

    The problem isn't the number of rows, it's the overhead from processing metadata that way.

    If you really are stuck with that kind of structure, I recommend moving away from any relational database, and going over to something like Cache (object-oriented database). It will work much better for this purpose. Bit of a pain to develop in, at least at first, but it's really quite good at handling this kind of data structure. Very fast and very reliable. Take a look at http://www.intersystems.com/cache/index.html for that database.

    Basically, instead of tables, it has object, and instead of rows and columns, it has properties, and instead of procs and triggers, it has methods. It support polymorphism, so you can add properties and such based on which customer is looking at the object. Will be great for what you're doing.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply