Performance: INSERTing records into "large" tables.

  • We are working on a database design for a "core" application. The database will contain history data and the tables are expected to grow into millions of records each.

    Other than the use of Indexes, and maybe data archiving (messy), what design strategies can be used to keep performance reasonable? I am concerned that as the tables grow larger, the time needed to INSERT a record will grow large too.

    Can filegroups (which I have never used) help out? Does the version of sql (Std, Enterprise) impact performance? Clearly we will give the box as much RAM as possible but I think the database design will have the most impact.

    TIA,

    Barkingdog

  • I would love to hear other approaches as well. I can tell you how it was designed where I was, but I don't highly recommend the approach for the reasons I specify. There has to be a better way...

    We created new "history" databases every quarter (or year - depending upon volume). The OLTP applicaton would insert into a view that was pointed to the correct database. There would then be different views for each quarterly/yearly database and union views to combine them.

    Advantages

    1) There were no performance problems with inserts.

    2) When it was time to delete the history, the database just needed to be dropped and the views changed.

    Disadvantages:

    1) Maintenance NIGHTMARE - creating the new databases, changing the views, the timing of the view changes, etc.

    2) UNION view performance - the union views did not perform well, but since it was data not used frequently, we could get away with some lousy performance.

    I don't know much about partitioning, but what we are doing sort of sounds like it - but hopefully partitioning works better than this design does.

    Cindy

  • I find it interesting that you are concerned about the time to INSERT records as opposed to SELECT records.

    As far as design considerations and the effects on performance, I'm not sure you'll get a holistic response as there are so many environmental factors to consider.

    For instance, the slowest part of the operation is disk i/o performance. As such, if you can "delegate" mutliple disks to perform the actions when you insert/select/update data, you will generally see performance improvement. But even so, this depends on how you have the multiple disks configured within your environment.

    External environment aside...

    If you're concerned about INSERT performance only, adding indexes will cause more writes to occur which has the potential to slow down INSERTs (depends on if your indexes are in another file group on a different physical disk). Although, the lack of proper indexes may also result in undesirable response times when you go to SELECT data.

  • mzak (6/8/2010)


    ...

    External environment aside...

    If you're concerned about INSERT performance only, adding indexes will cause more writes to occur which has the potential to slow down INSERTs (depends on if your indexes are in another file group on a different physical disk). Although, the lack of proper indexes may also result in undesirable response times when you go to SELECT data.

    Fortunately, there's a "workaround": you could drop the indexes during insert and recreate after insert. But this would assume that the insert will be at a time with (almost) no SELECT statements (e.g. nightly job).

    Another "tool" that might help: horizontal table partitioning.

    But, like stated before, simply not enough information available right now...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Although not pleasant an archival strategy should be considered early on and in many cases it would be better to be in a seperate database. I can give a couple reasons for this:

    1. Maintenance and backup time, by moving the historical data away from the "live" data you minimize how long maintenance on the "live" data takes, you also decrease backup and potential restore times in the event of issues. Also when the data is archived you might perform backup and other maintenance less often.

    2. physical disk seperation, you can place the live and archive disks on different drives or spindles, you can also move your archive to less performant drives.

    CavyPrincess described what I would call manual partitioning and it is ... unpleasant to do, the built-in partitioning is much better, but it is still not simple. This is usefull in both the live and archive databases. In the live system partitions tend to be smaller, often months, in the archive years, and at a period way out 10's of years.

    Insert performance is affected by a great many things, table structures, indexes, and substantially the disk subsystem which is VERY often (but not always) the bottleneck. I mention indexes because they tend to decrease write performance and increase read performance. If your primary concern is write performance you might want to VERY seriously look at the table normalization you are doing and try and keep the table narrow in terms of columns and field sizes.

    You indicated a willingness to throw a lot of memory at the server, that is good, I would also recommend 64-bit SQL 2008, I would avoid 32-bit at pretty much all cost. Filegroups CAN help seperate out different sections of the database but depending on your storage system this might be a lot of work for almost nothing. In many larger organizations the servers only have a boot drive and everything else in on a SAN, in many of these cases filegroups on different drives aren't going to get you a lot.

    You asked about performance in different versions, in general the answer is no, with caveats, there are features and limitations imposed on most versions in the form of how many processors and how much memory can be used, see: http://www.microsoft.com/sqlserver/2008/en/us/editions-compare.aspx. One of the features called lock pages in memory used to be an enterprise ONLY feature, but I believe with a recent service pack for 2008 it is available in standard, this keeps pages from SQL's memory from being put into the page file in the cases of memory pressure.

    To be honest it would be easy to teach a whole class on database design, in my view it is quinticentially (sp?) important to be able to view the big picture and see how everything interacts and how things will affect read and write performance.

    I will also say, you will probably do it wrong a few times, it hurts, but it happens. Application of classroom teachings in the real world isn't usually one-for-one..

    CEWII

  • Barkingdog,

    We have an application with many clients that have a few tables measured in the 100 millions of rows. These are proper meaty applications that do things spread over a number of tables, not just updating a row or two millions of times as in the TP benchmarks.

    You can read lots about problems with clustered index fragmentation caused by INSERTS and in my experience it just needs to be managed. Pick a clustered index that doesn't fragment much (small ever increasing key). If you can't, rebuild periodically offline or online with Enterprise edition.

    Ever increasing key. The stuff that used to be true about insert contention on the last page of the table just isn't true any more. It's actually quicker with an ever increasing key. I know it defies logic but try writing a test harness and prove it like I did when I read a post of Gail's some months ago quoting Kimberley Tripp on the subject. I didn't believe this would work but it was 100% correct.

    If your SELECTS always end up using an index seek you are also fine.

    HOWEVER what I find is that you end up with things that do index scans. And this is what you need to worry about. Because it happens! A badly behaved query ends up with a big table in data cache and loads of great plans from totally unrelated queries binned from the procedure cache.

    So you could:

    Partition horizontally: E.g. All 2009 orders in one table and 2010 in another. Microsoft have decided to call this sharding in Azure. It's messy to manage.

    Archive stuff: (to another table / database). Also a bit messy but can be managed as a separate process and therefore not affect the main application. This is a good option if this is a new app.

    Finally I agree with other posters on two points:

    1 Fast disks are key. But when a SAN is involved, it becomes much more problematic to understand the physical I/O that will be achieved.

    2 Therefore, the MORE the BETTER with cache memory, providing you never forget that this mitigates an I/O problem rather than solves it.

    Hope this is useful

    Tim

    .

  • Thanks to everyone for some really fascinating answers.

    Barkingdog

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply