Andrew Kelley, SQL Server MVP, is a good speaker, so I'd recommend his sessions if you want good technical information.
IO Stats are a snapshot in time since the instance was restarted, so you must have at least two reads to compare the data. Stats are a snapshot. They are not reset until the instance is restarted. This means that you must be storing this data between checks to have any meanungful data. A single read from the stats DMVs doesn't give you a reference as to what the server is doing.
NOTE: Auto close does reset the stats (not sample_ms, but the counters). Be aware of this, especially with smaller applications/databases.
Solid State DIsks - Not many people using because of costs, but more and more they are in places, tempdb, pagefile, etc.
sys.dm_io_virtual_file_stats - Stores data from all databases, holds dbid and fileid to differentiate. Some changes in SQL 2008 that help.Some countere are separated out instead of being totals (wait stats, I think).
FileStats - Reported numbers are physical I/O, not logical. Logical can be many times larger than physical.
Don't forget tempdb. Lots of tempdb usage and it's easy to forget about it and concentrate on your user database.Often tempdb is used more than extensively than people realize and have a huge performance impact on the server.
Use FileStats with WriteLog waits to see how efficient your log writes are. Log writes are sequential writes, so these can slow down your activity. T1 must complete writing before T2, before T3. If T1 is delayed, all other transactions are delayed. Data writes can be deferred until there is more time and the changes held in memory.
Don't use multiple log files. No performance advantage for these. There is an exception in extremely large databases, but not for performance.
Backups can skew the results. Account for them in watching stats. These are complete physical IO. Transaction log and database operations are both included here. Backups will have multiple reads. Each read can be up to xx (512kb??) bytes. Accounting for this can be tricky. Not sure about verifying backups. Should account for reads. The account for this, you need a snapshot before and after the backups to get an idea of the reads used by the backups.
Filestats are easy to capture, so there's no excuse to examine these on a regular basis. Frequnecy might vary. Daily, weekly, what you need. Be sure you review the reports. Capturing the data doesn't help if you don't examine it. This data is important for a baseline.
Code:
DBCC DROPCLEANBUFFERS -- cleans out databases. Useful for testing.
Gather stats, build table with same structure of table, then load with snapshot from the DMV. Include the datestamp for the data.
Join with system DMVs for friendly names of files and databases.
-----
All these counters point to hardware issues. Not that hardware is bad, but it's a delay with OS, drivers, controllers, disk, cables, etc. Configuration can be an issue here.
Log writes should be very short, so stalls should be < 2ms. Data writes can be longer, but they can impact things as well if they grow too much above 10ms.
Don't assume a SAN is fast. Make sure that you are checking and verifying there are not delays on those IO paths.
The counters include aggregates from all threads. So 5s of wait or stall doesn't mean that it was 5 physical stats. If you had 5 threads, then you could have 1 physical sec of wait across each thread. Be aware of parallelism here.
Some common problems:
High stalls on writes
- - add more write-back cache on the controller. This speeds up writes from the instance.
- You can change the ratio to be 100% writeback instead of read. SQL Server doesn't necessarily benefit from read caches on disk.
- Also be sure that you are R1 or R10 or R01, not R5.
- If possible add spindles to the array.
- Maintain physical separation of data and logs on disks.
- And don't put non-SQL Server traffic on the array.
High Stalls on reads
- optimize queries., index, tune, reduce scans or translate into seeks.
- Same, make the controllers 100% writeback.
- Use same disk strategies from writes.
- Don't forget tempdb
Wait Stats
Anytime SQL Server has to wait for something, it records an event (WaitType). Time is in ms.
In 2005 there were 194 types, and more in 2008. In 2005 there wre some that were hidden, but more are exposed in 2008.
These numbers are cumulative from the last restart or the last clear. Users can reset these counters here, so be aware of this.
DMV is sys.dm_os_wait_Stats. Contains the type, the count, wait time in ms, max wait time, and signal wait time. Be aware that the max wait time is from the last reset, not between samples. Signal wait time is time from when resource is available until it is used. Wait time is from time resource needed until IS IT USED, not untiil it is available.
Wait stats can narrow down where bottleneck is. If there is a lot of signal time, you might have CPU pressure. This replaces dbcc sqlperf(waitstats) from SQL 2000.
dbcc sqlperf('sys.dm_os_wait_stats', clear)
Be careful of OLEDB waits.
If you report on these stats you'll see the top types of waits, which can help you diagnose where the slowdowns are in your system.
Common Waits
CXPacket - From inefficient parallel processing, where one or more threads are waiting on things to finish. Hyperthreading can add to the problem here. Consider adjusting MAXDOP here to reduce this (at server or query level).
Locks (LCK_xxx) - Long running transactions or blocking. Tune, index, reduce contention.
ASync_NETWORKIO - client isn't getting data as quickly as SQL sends. May be network issues, but likely the client is to blame.
IO (PAGEIOLATCH, IO_COMPLETION, WRITELOG) - Physical IO waits. Waiting on disk into or from memory. Storage subsystems issues here. Writelog should be as log as possible.
Latches (pagelatch_xx) - Not IO related, this is contention with internal resources. Heaps and LOBs can cause latching waits. Lots of inserts into one page can cause issues here along with page splits.