Memory lacking causing disk issues?

  • Shifty,

    Do get the wait stat information, it's important, but this also concerned me:

    Average disk sec/Trans

    P drive (RAID10 - 14 drives) 0.014

    G drive (RAID10 - 14 drives) 0.014

    K drive (where DB is, RAID10 - 14 drives) 0.041

    F drive (where tempdb, tlog are, RAID10 - 4 drives) 0.007

    C drive (OS, pagefile, RAID1) 0.005

    This should be around 10ms average, maxxing off at 50ms or so. You're at 41ms average. That's deduced from some technet articles on Average Disk sec/read and write. The question is why is that LUN so high?

    Have you already checked for fragmentation in this system? Has someone manually gone and looked at the hardware and made sure the configurations are intelligent for the hardware itself? If it's hitting the database a lot sure, it might have a lot of reads, but it shouldn't have a longer time per read than the others unless something's up. From Perfmon itself:

    Avg. Disk sec/Transfer is the time, in seconds, of the average disk transfer.

    Can you add in the read/write counters for K:\ drive and see which, if either, is taking the longer piece? My guess is writes but I'm very curious.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Hi, shifty1981,

    Did you check FreeListStalls ? FreePages? MemoryGrantPending ?

    Did you have a lot read-ahead reads?

    You show us the %DiskTime is 7358%? If K is where your mdf files are (only mdf, no other unrelated IO), you definitely have lots IO. If your setup is correct, you are short of memory.

    Can you confirm these drive volume are raids by each itself? not one big raid array, allocated by different lunds? (I would not think so)

    Also run Diagnostics on drives to make sure disks are not deteriorating.

    2008 X64 Windows and SQL Server 2008 will likely to support as much as your hardware (MoBo) can support. I will max it. If you went thru Christian Bolton's "walk down memory lane", you will know everything goes thru RAM first before it goes to IO, reads or writes.

    Average Disk Queue Length

    K drive (where DB is, RAID10 - 14 drives) 71.22

    % DiskTime

    K drive (where DB is, RAID10 - 14 drives) 7358.7

  • Jason,

    %DiskTime and Queues are no longer valid parameters if you're working against SAN systems. If you have independent RAIDs, that's different. They're really meant for single spindle checks. With write cacheing, raid controller interfaces, and the like, they're at best 'breathing' monitors. If you're suddenly breathing faster than normal, you might check into something. But each baseline is different and no specific # is indicative of anything once you start looking to the SAN.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Dangit, what's with me and double-posting today? Sorry, ignore.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I think the problem could be with the index fragmentation and missing indexes and also may be the disk fragmentation too.

    Disk Metrics for K drive look pretty bad. Full scans per sec are also not good.

    Page file usage is greater than 70% which could be a concern.

    Are you using Lock Pages in Memory ?

    Are you regulary defragmenting the indexes using Reorg\Rebuilds and updating the ststistics ?

    Latch waits per sec also seems to be high which is an indication of problems with Disk IO System.

    Try to find any missing indexes and add high impact indexes and check for any unused indexes and try to remove them after checking with the concerned parties.

    BCHR should be > 90% and PLE should be atleast 300 sec for a normal working system.

    Checkpoints\Sec and Lazy Writes\ Sec are also fine with your system.

    So I hesitate in concluding that your system is experiencing memory pressures.

    Some of the counters you are collecting are relative like Full Scans, Latch waits per sec etc , so try to collect the counters along with their relative counters for effective troubleshooting.

    Thank You,

    Best Regards,

    SQLBuddy

  • Here are the results:

    the K drive is on a SAN. yeah I'm not too excited about those stats. it's weird though, we have an individual san just like that for each of our servers like this one (yes they have their own san each, don't ask 🙂 ) and the others are getting fine stats. I'm not convinced it's the K drive that is causing such bad results for paging though because other systems are fine with K and just as terrible.

  • Craig Farrell (1/7/2011)


    Shifty,

    Do get the wait stat information, it's important, but this also concerned me:

    Average disk sec/Trans

    P drive (RAID10 - 14 drives) 0.014

    G drive (RAID10 - 14 drives) 0.014

    K drive (where DB is, RAID10 - 14 drives) 0.041

    F drive (where tempdb, tlog are, RAID10 - 4 drives) 0.007

    C drive (OS, pagefile, RAID1) 0.005

    This should be around 10ms average, maxxing off at 50ms or so. You're at 41ms average. That's deduced from some technet articles on Average Disk sec/read and write. The question is why is that LUN so high?

    Have you already checked for fragmentation in this system? Has someone manually gone and looked at the hardware and made sure the configurations are intelligent for the hardware itself? If it's hitting the database a lot sure, it might have a lot of reads, but it shouldn't have a longer time per read than the others unless something's up. From Perfmon itself:

    Avg. Disk sec/Transfer is the time, in seconds, of the average disk transfer.

    Can you add in the read/write counters for K:\ drive and see which, if either, is taking the longer piece? My guess is writes but I'm very curious.

    The stats for last night's peak usage time (7pm to 5am) were .008 ADsWrites, .050 ADsReads and .041 ADsTransfers. It did seem surprising to me, but then I looked and that's true of most systems (higher read stats than writes).

    As for fragmentation, what do you mean? And what do you mean about intelligent configurations? Can you be more specific?

  • jswong05 (1/7/2011)


    Hi, shifty1981,

    Did you check FreeListStalls ? FreePages? MemoryGrantPending ?

    Did you have a lot read-ahead reads?

    Don't knwo what these are? Are they in perfmon? Is there a good article pointing to what stats we should strive for and what it means when they are too above or below those?

    You show us the %DiskTime is 7358%? If K is where your mdf files are (only mdf, no other unrelated IO), you definitely have lots IO. If your setup is correct, you are short of memory.

    K drive is where one filegroup is which contains an ndf file for one of our most active tables/indexes. there's only one file in this. I'm guessing % disk time is high, though I'm a bit rough on what that means. however since it's a SAN I remember reading it's not odd that it's above 100%. There are 14 drives I think in a RAID10 set. Lastly, can you comment on what specifically led you to say we're short on memory? Thanks.

    Can you confirm these drive volume are raids by each itself? not one big raid array, allocated by different lunds? (I would not think so)

    confirmed. C and E are 2 luns on one raid1 set, but I just did the physical stats for that set, not logical .

    Also run Diagnostics on drives to make sure disks are not deteriorating.

    2008 X64 Windows and SQL Server 2008 will likely to support as much as your hardware (MoBo) can support. I will max it. If you went thru Christian Bolton's "walk down memory lane", you will know everything goes thru RAM first before it goes to IO, reads or writes.

    never heard of this before. a book or online article? and what tool do you suggest for disk diagnostics. I hope not diskchk. that can hose a SAN.

  • Craig Farrell (1/7/2011)


    Jason,

    %DiskTime and Queues are no longer valid parameters if you're working against SAN systems. If you have independent RAIDs, that's different. They're really meant for single spindle checks. With write cacheing, raid controller interfaces, and the like, they're at best 'breathing' monitors. If you're suddenly breathing faster than normal, you might check into something. But each baseline is different and no specific # is indicative of anything once you start looking to the SAN.

    thanks for the comment. I would appreciate us discussing this however. Only because this server is rare in that the DB is on a SAN. most of our environments have the DB on a internal disk RAID10 set. actually on this server, part of the DB is on this internal set (F drive). They moved part of the high IO table/index to this SAN volume because they needed more space and figured the SAN would be faster. I wasn't here for that decision and the CEO is ticked that we still didn't meet our marks.

    that being said, is %disk time relevant for the itnernal disk sets? and the for the queue lengths, is it not valid so long as we do the math (2 per disk is the goal)?

    lastly, if san disk cache is the concern with these stats, what about the cache in our raid controller card on the server? those have either 256, 512, or 1024. and while this may not be best setup, I think the previous admin set all volumes to be in write-back mode since they have batteries on the cards.

  • sqlbuddy123 (1/7/2011)


    I think the problem could be with the index fragmentation and missing indexes and also may be the disk fragmentation too.

    Disk Metrics for K drive look pretty bad. Full scans per sec are also not good.

    Page file usage is greater than 70% which could be a concern.

    Are you using Lock Pages in Memory ?

    Are you regulary defragmenting the indexes using Reorg\Rebuilds and updating the ststistics ?

    Latch waits per sec also seems to be high which is an indication of problems with Disk IO System.

    Try to find any missing indexes and add high impact indexes and check for any unused indexes and try to remove them after checking with the concerned parties.

    BCHR should be > 90% and PLE should be atleast 300 sec for a normal working system.

    Checkpoints\Sec and Lazy Writes\ Sec are also fine with your system.

    So I hesitate in concluding that your system is experiencing memory pressures.

    Some of the counters you are collecting are relative like Full Scans, Latch waits per sec etc , so try to collect the counters along with their relative counters for effective troubleshooting.

    Thank You,

    Best Regards,

    SQLBuddy

    I don't know the answers to most of your questions. I'll have to ask our lead developer and resident dba (though not officially). part of the issue is the size. our db is between 500GB and 2TB on our servers and the indexes can be upwards of 75% of that total size (well at least one is, the others are in the hundreds of gigs). I'll ask about the rebuilding of indexes.

    Can you recommend a good article that lists goal stats that I can refer to when you say "this is high, that is good"? And what other counters would you suggest I watch?

    Thanks so much!

  • shifty1981 (1/7/2011)


    thanks for the comment. I would appreciate us discussing this however. Only because this server is rare in that the DB is on a SAN. most of our environments have the DB on a internal disk RAID10 set. actually on this server, part of the DB is on this internal set (F drive). They moved part of the high IO table/index to this SAN volume because they needed more space and figured the SAN would be faster. I wasn't here for that decision and the CEO is ticked that we still didn't meet our marks.

    that being said, is %disk time relevant for the itnernal disk sets? and the for the queue lengths, is it not valid so long as we do the math (2 per disk is the goal)?

    lastly, if san disk cache is the concern with these stats, what about the cache in our raid controller card on the server? those have either 256, 512, or 1024. and while this may not be best setup, I think the previous admin set all volumes to be in write-back mode since they have batteries on the cards.

    I'll get to the above in a second.

    62% of your system waits are Disk I/O (PageIOLatch). That's a drive problem.

    Your comment above about fragmentation concerns me. You've been brought in as the expert and you're not familiar with logical vs. physical fragmentation. It's time to do some research.

    Read the following links:

    http://technet.microsoft.com/en-us/library/cc966523.aspx - Explanation of Fragmentation, technet whitepaper.

    http://www.sql-server-performance.com/articles/per/detect_fragmentation_sql2000_sql2005_p1.aspx An article on methods for detecting fragmentation. That article doesn't mention you only care about an index level at ~1000 pages or higher.

    Two articles on physical fragmentation:

    http://www.mssqltips.com/tip.asp?tip=1481

    http://stackoverflow.com/questions/719849/is-physical-file-fragmentation-of-the-mdf-file-an-issue A short and sweet version of the above.

    Now, regarding the above comments on disk usage and queues...

    When they split the database, did they create different internal logical filegroups, or did they simply add a file to the PRIMARY filegroup? The reason I ask is because if you didn't use an internal logical filegroup, you have literally no idea where your data is stored, it could be on either chunk of the .ndf files.

    Now, those stats are useful for dedicated spindles, such as an internal or dedicated disk array. % of disk time is taken across all spindles, so divide the percentage by the number of spindles in the array for a solid idea of your disk access time. Realize that this can still be inaccurate if your OS is using the same spindles, but different logical LUN, on the spindle set if you vertically stripe it to logical partitions. The same happens with queues... but this needs elaboration.

    The Disk Queue was mostly important when there was no such thing as a read/write cache at the hardware level for RAID controllers. What this indicated was the number of threads in the OS waiting for the spindles to give it control to do something with the hard drives. This was incredibly important with the log file, and obviously impacted any heavy write systems.

    However... and this is the standard answer these days for SQL Server... does it matter to you? It depends. It depends on your hardware. It depends on the software controlling the hardware. It depends on how the physical spindles are striped logically. I can't answer this question directly.

    Write-back mode is probably your best bet, though I've personally been bitten by that format during a series of catastrophic failures. It's rare, though, and we finally tracked down my year of hell to a short in the wiring, so your mileage may vary. In general it's supposed to be very stable.

    Everything you've posted and discussed points to a drive access issue. 0.050 read time is incredibly high, to the point that I'd be harping at the network admins to go physically check the hardware and make sure I didn't have a dead spindle in a RAID 5 array. Talk to your network admins and make sure you're not sharing physical spindle space from your LUNs with another hyperactive data access software. You mention the SAN is dedicated. If it is, something is horribly horribly wrong here... besides the fact that your network gurus hate you. 😉

    I would consider the following troubleshooting steps:

    1) Confirm there are no serious hardware issues.

    2) Confirm that you have removed or reduced physical and logical fragmentation to a reasonable level, in that order.

    3) Confirm your statistics are all up to date.

    4) Review all commonly called SQL statements by running a serverside trace for 24 hours, and then taking a count of the most commonly called statements. (textdata column and some manual interpretation)

    4a) Optimized the top 10 (roughly, do this with an open mind. Might be four, might be forty) calls to reduce scans on indexes or tables.

    5) Take Old Yeller out back and do what's necessary. 😛 It will need a redesign if you still can't deal with these serious latency issues after this point. I would call in one of the gurus and have them go over your system with a fine toothed comb.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Some good advice from Craig here. If you need a consult, even remotely, I have a few people I would recommend. They can likely give you some suggestions for a few hundred $$$.

  • Craig made a good explanation. Also please check the following articles too

    http://www.mssqltips.com/tip.asp?tip=1708

    Scipts to do Index maintenance

    http://ola.hallengren.com/

    http://sqlfool.com/2009/03/automated-index-defrag-script/

    Thank You,

    Best Regards,

    SQLBuddy

  • Hi everyone. sorry for the delayed responses. Was gone for a while and then had to catch up at work. Much to give an update on. I'll try and do my best and see where we can go from here. I'll add one caveat - I was brought on as a jack of all trades "expert" and in large part because my mentality towards solving issues and taking the initiative to learn and find out where I was short. I'm certainly not a dba or performance expert. My of my administration has been on the application side and quite frankly no environment I've ever worked with had noticable performance issues that required such deep digging. Thank you for helping us through this. If it is necessary we may look to someone for help on a consult basis. There are specific things that we know need to be implemented but we don't know how and the environment isn't stable enough for me to go to training yet. OK, enough said. let's share some info:

    At the request of the CEO (it's a SMB with an involved CEO), we took 32 GB memory from a decomissioned server that was the exact same as the one in this server and added it, allocating all of it to SQL. So now 64 is allocated out of 96. Additionally the dba determined we had about a billion rows in our main table that we no longer needed (expired data) and so we removed them and now have only about 1.4 billion. So I re-ran that query and the results are below:

    I have been told that rebuilding our indexes typically takes about 2 hours and that it puts a pretty heft load on the server. However I hoping to find a query that I could run daily (like mid day our lowest usage time) and determine the index fragmentation and then somehow determine if we should rebuild. We're using SQL server 2008 R2 Enterprise x64 edition so my understanding is that this can be done while database is online right? The other admin said he'd been trying to reindex every month, but there's no certainty if he actually remembered to do this. I'm hoping there might be an event log or something in SQL I can query and then track. We have tools that can do this easily if given a query (either SQL or eventid to look for).

    I'm not sure how to determine if we are using Lock Pages in Memory. I think we are because on more than one question the lead developer has said "I made sure this new XYZ query/tool did not use locks" which leads me to believe sometimes they do. Is there a simple way to verify this?

    I'm not sure what sqlbuddy means when he says "find missing indexes". the lead developer seems to think we have already created the indexes that are important for our database as these get 99% of the activity. one index is clustered for a certain order on the disk and there are 4 other non-clustered indexes as well. on a server where we were able to separate out the non-clustered indexes from the clustered one and also from the rest of the db, the stats on the 'rest of the db' are excellent as far as the disk is concerned, which further supports the idea that we're not hitting whatever other tables are left very hard.

    We have another server that's new but is very heavy on activity and database size. It's a Dell R510 and we separated it as follows. I'm only sharing so I can show you the stats from that query to compare. Everything is the same except the number of drives and thus how the db is separated out. It is using internal disks for the database:

    C - RAID1 - OS+pagefile

    I - RAID10(4drives) - 4 non-clustered indexes

    L -RAID1 - (tempdb and transaction logs (remember we're in simple mode). We used to have these each on own RAID1 set the but the disk stats were excellent and the sizes were small so we decided to let them share disks. This is a case of uniqueness of our environment. our whole database is brand new every 10 days typically as data coming in expires and is deleted.

    V - RAID10 (4drives) - clustered index of our main table

    F - RAID1 - rest of the DB

    anyway here are the stats

    On the original server in question there are 2 filegroups. The primary and a new one. The new one contains the main table, the clustered index, and the nonclustered indexes. they are in one file. The primary filegroup is on a different raid set and is the main MDF file. In the new server, the NC indexes are in a single file group in the RAID10 set, the C index is in it's own RAID10 set and the rest is in the main MDF file. Our next goal is to separate out the 4 NC indexes into individual filegroups(files) so we can start monitoring and see if a specific one of them is the culprit for the heavy load. on this new server, that NC index raid set gets hammered way more than any other RAID sets we have.

    all %Disk time stats are against physical disks. the OS is not on the same disks/raid-set as the main database. I need to read more about % disk time because I'm a bit confused on it's meaning and goal to shoot for. Again we're not talking about a SAN yet. the poor stats are on the internal disks (which are using a physical RAID controller from Dell with either 256, 512, or 1024 cache).

    as for write-back mode, they are confident in it for two reasons. 1 - they are battery backed up and they do a full backup every day so their SLA is ok with an occassional crash because they run tools to check for bad data and if it exists our clients just resend it. 2. secondly we are looking into getting non-volatile cache in our raid controllers. it's relatively cheap in comparison to the cost of the server as a whole, so we figured why not?

    As it turns out in this small company I am the network admin too. Thus I know for a fact that none of our servers are running RAID5 sets and no servers have a failed drive. There is no other application on these servers except ours. there is AV they put on so one of my goals is to double check their scan settings, to filter out the db volumes, etc. this may be a bad sign, but since the people here are used to Avg Disk sec/transfer speeds in the hundreds they are pretty excited to see a server between 20 and 50 ADsT. I keep saying 'no, even that is bad' but they feel it's probably the best they can do without buying a SAN or DAS for each database which kills our ROI by a bunch.

    While I do think index fragmentation would be helpful to rebuild I am a bit leary of disk fragmentation. It would likely take forever to run with SQL offline. additionally as mentioned the data coming in only lasts about 10 to 14 days so ever if we did defrag, it would need to be done over and over and possibly by the time it's done there isn't much lifespan left for the data. Just a thought. Maybe I'm wrong?

    running the server side trace for 24 hours would be fantastic. no one here knows how. in fact our current thought is this: 1. We're looking at the hardware performance. 2. No one knows how SQL is performing (the queries). 3. Once those are fixed/determined we can go to the developers and say "fix these bugs/poorly written queries". that middle part, number 2 is the key part. Much to learn here.

    Thanks everyone. believe it or not, though I feel like weights are about to crush me under this knowledge dump I'm getting since joining this company, I absolutely love learning all this stuff!

  • shifty1981 (1/19/2011)


    ...

    I have been told that rebuilding our indexes typically takes about 2 hours and that it puts a pretty heft load on the server. However I hoping to find a query that I could run daily (like mid day our lowest usage time) and determine the index fragmentation and then somehow determine if we should rebuild.

    Need Index maintenance ?: have a look at IndexOptimize at http://ola.hallengren.com/Documentation.html

    Off course you'll also find that kind of scripts at SSC.

    Attached you can find the proc I'm using to only rebuild indexes that have a certain amount of fragmentation.

    We're using SQL server 2008 R2 Enterprise x64 edition so my understanding is that this can be done while database is online right?

    There are some limitations. ref: http://technet.microsoft.com/en-us/library/ms190981.aspx

    Online rebuilds cannot be done on indexes which hold image, ntext, text, varchar(max), nvarchar(max), varbinary(max), and xml data typed columns.

    If rebuilding a clustering index, keep in mind that one holds the actual data pages at leaf level ! so the limitations also go for that one.

    The other admin said he'd been trying to reindex every month, but there's no certainty if he actually remembered to do this. I'm hoping there might be an event log or something in SQL I can query and then track. We have tools that can do this easily if given a query (either SQL or eventid to look for).

    If you schedule this maintenance script using a sqlagent job, you'll have your logging in the job history.

    I'm not sure how to determine if we are using Lock Pages in Memory.

    Lock pages in memory is meanth for your sqlserver instance so it will not page out memory if other stuff on your server needs ram. It is granted at windows level for your sqlserver service account.

    SQLServer locking is sqlserver internally only and is a whole other chapter than lock pages in memory.

    There is a DMV exposing this in column locked_page_allocations_kb:

    Select * -- euhm yes * to show all available stuff. Pick the ones you are interested in later.

    from sys.dm_os_process_memory

    ;

    Also at startup time, your sqlserver instance records a row in the sqlserver errorlog file if the service account has been granted "lock pages in memory" at windows level.

    The recorded message is : Using locked pages for buffer pool

    (double check your errorlog file hasn't been rolled over since startup or you'll not find this message)

    refs:

    - How to: Enable the Lock Pages in Memory Option (Windows)

    http://msdn.microsoft.com/en-us/library/ms190730.aspx

    - Lock Pages in Memory ... do you really need it? http://blogs.technet.com/askperf/archive/2008/03/25/lock-pages-in-memory-do-you-really-need-it.aspx

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

Viewing 15 posts - 16 through 30 (of 35 total)

You must be logged in to reply to this topic. Login to reply