Splitting Data Files

  • I am migrating my production box to a new machine that has three raid5 arrays. My current production environment is primarily a data warehouse in which there are large batch processes that force the machine to become I/O bound. There are 80% reads vs 20% writes roughly. In many cases we cannot really tune performance of the T-SQL much more than we already have.

    My question is - what are the I/O implications of splitting the Primary file group for the client data files and placing them on separate disk arrays? Would I get roughly double the I/O throughput? What If I split them into three separate physical files. Would I then get roughly triple the I/O throughput? If so are there any downsides? I know I still have my tempdb and log files to worry about, but please share with me about the file splitting or point me to some online resources.

    Thanks

  • Generally, splitting your data across as many read and write heads as possible will help performance.

    You have to examine your hardware first though.

    Remember RAID 5 causes an XOR operation when you write (to create the parity map). Make sure your RAID controller has an XOR processor so this is not being done on your server CPU (few controllers do not have one these days).

    Also, when using multiple arrays, the RAID controller can become a bottleneck and you want to make sure it can handle the requests as fast or faster than all of the read and write heads on the drives can.

    To really maximize your performance, you want to utilize the maximum number of read and write heads at the same time. So make sure you split your database appropriately. If you put groups of tables that are all accessed at the same time on one array, you will see less performance benefit. Also, TempDB is used by lots of processes, so having that on it's own array (if you can afford it, a solid-state array) can be a big boost. Don't forget you can put tables and indexes in different file groups, so if you have some indexes being used for one process and others for another process you can give the processes their own read heads.

    Take the time to speak with your hardware vendor and tell them what you are doing. They should be able to help. Think it through and do some trial and error testing.

  • Thank you for your feedback. The idea of splitting the clustered indexes from the indexes and placing them on separate file groups which would then be on separate disk arrays was my thought before I talked to my folks here.

    The thing they have done before I got here was they split the data files using just one file group and proportional fill. I have never used this strategy before, but the impression they have is that data from objects is split between each file. BOL says that

    "SQL Server uses a proportional fill strategy across all the files within each filegroup"

    Since there is only one file group (primary) does that mean that an index or clustered index would be written half on one file and half on the other?

    Thanks

  • The down side to this strategy is index fragmentation takes it's toll very quickly and can cause queries to have to do a lot of switching from one disk to another. It can work though. I have never been a fan of this because you lose control over what is on which disk array.

    Take the time to do some testing with your database and a reasonable workload.

  • Tobi,

    I think the splitting is what happens and you get half and half over time. Most people that want to split things, use separate filegroups, say one for data (clustered indexes) and one for non-clustered index.

    Just curious, if you are going to the trouble of splitting, why not use separate filegroups?

  • Using separate file groups is the only way I would do it if I were the DBA and had authority over the system. The situation I face is that this is the architecture that this shop has used for a couple years now, and I didn't really come on as a "DBA", but rather a data analyst, so the only thing I can do is recommend the best strategy to use.

    I haven't run into this file splitting strategy at any other shops before. In the past my recommendation has always been to split files only by filegroup. It seems weird to think about having half of an index on one array and half on the other. On one hand it seems like that would introduce X2 fragmentation, but on the other hand it seems like maybe the index would be read twice as fast. I just don't have any documentation that references this architecture other than to say it can be done.

    I guess minus the documentation the best thing would be to test. I am guessing that they might not afford me the time to test though, which is why I was hoping to get some references to whitepapers or something stating why you should use filegroups to control where objects are saved.

  • 1) It has been my experience that queries are almost NEVER as optimized as my clients think they are.

    2) You can pick up tremendous throughput improvements with appropriate data placement.

    3) There is WAY, WAY more to performance tuning than just making a few files on different RAID sets!!!! Tell the company to get an expert to mentor them in redoing the warehouse storage, indexing and queries. There are numerous to be found on this forum or you can find a local SQL consultancy. The ROI on a good tuning person is astounding.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thank you for your reply, but it really didn't have anything to do with my question. I really just want information related to my question about splitting files within the same filegroup. I know its not a panacea to performance, its just one small factor that I am trying to research.

    Thanks

  • I thought I did address your question directly: "you can see tremendous throughput improvement...". And the question cannot be answered as explicitly as you ask. I could take one system and double the files across 2 drive systems and see NO throughput improvement (or even SLOWER performance). I could take another and spread it across 4 drive systems and see a 6-fold throughput improvement.

    You made a very key point yourself which I also touched on - files are one small piece of the performance pie. The problem is that if you aren't an EXPERIENCED and highly trained DBA you won't even know a lot of things to evaluate/adjust and will likely not be optimal about addressing the issues you do know about. That isn't a personal attack and shouldn't be taken as such. I can't fly a helicopter (despite the fact that I can fly an airplane) because I am not trained to do so. If I need a helicopter I hire a pro to make the best use of it. 🙂

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • That is a very valid point.

    Can you split data files over disk arrays and see performance increases? - Yes

    Can you see performance decreases? - Yes

    Having someone with experience or training for this task will make it much easier.

    Barring that, you will have to do lots or reading and testing.

  • My initial recommendation to this client was to use two file groups each having one physical file. Make the Secondary file group the default and send clustered indexes to the primary and indexes to the secondary. Beyond that I made several other recomendations based on performance counters I have gathered, testing I have done, and the physical architecture of the old system versus the new system. For instance I am suggesting that they use one raid 10 array for the log files because they will be 90%+ writes and not have to worry about Raid5's Xor parity write.

    However I was getting push back from the project manager because he doesn't want to do a bunch of work dealing with where all the objects are stored and thinks the current way they have their files is good enough. Obviously testing will reveal the answers I seek, but I was really looking for some heuristics.

    For instance do the extents get split between the drives, the pages, or the data rows? I would think fragmentation would be more of an issue for a clustered index that was split between files. What about thread spawning, will more threads generally get spawned where more files are concerned and will this generally be a good thing or create contention because one spindle needs to seek data for two threads where one is seeking data off the clustered index and the other off one of the indexes. I was just looking for general information like this becasue my client probably won't spend the money to do the testing necessary to figure out what the optimal strategy really is.

    Anyway, I get the feeling that there hasn't been much testing of this sort done and documented out there so I will see if the client will let me do my own.

    Thanks,

  • you're all getting confused and muddled over files and filegroups. To improve read/write performance you can split your database filegroup into multiple files ( not filegroups ) based upon one file for each processor/core , these files can exist on the same array and should improve performance by utilising a read/write thread per proc ( there can only be one thread per file )

    You can further split into filegroups, each containing multiple files, but the filegroups should exist on seperate dedicated spindles, a common problem is to define multiple luns on the same physcial spindles in a san/nas this totally counteracts any gains.

    so multiple files per filegroup.

    You can also gain by just putting your three raid arrays into one, as you lose 1 disk per raid 5 array placing them into one would actually gain you the added performance of two extra spindles, and your files would be spread across over three times as many spindles which I suspect would give much better performance.

    Files ( not filegroups ) do balance automatically, you could of course split your files across multiple arrays too!

    Given that I hate raid 5, if I had 18 spindles I'd create one big raid rather than three smaller ones - I figure you'd get much better performance.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Actually - if I remember correctly - the wider your raid-5, the bigger the write performance hit you take on random writes, since you now have to do more reads to come up with the checksum. Now - You'd tend to improve reading with the same scenario.

    So - depending on your activity - wider might not be better in RAID-5. If you do a lot of displaying of the existing data, then wider might work out better, otherwise - I'd leave it alone. (Actually no - I'd rip it out and replace it with RAID10, but that's just me:))

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • yeah raid 10 all the way. Raid 5 doesn't work that way that I know as it's extra is on the parity disk not the array. The more disks you add to raid 5 the greater the chance of the array failing, it's a proportional increase, raid 10 is constant regardless of number of spindles, perhaps that's what you're thinking?

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Nope - referring to the fact that you're stored an XOR on the parity, so you got to pull all of the pieces (even if you updated only one) to evaluate the XOR... more pieces, more reads.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply