February 6, 2010 at 3:58 pm
I run a high-availability online game with SQL Server 2005. I'd put my level of SQL knowledge at around high intermediate, meaning that I know enough to identify a lot of common problems problems, figure out what indexing is helpful, hand-optimize procedures that need them, and catch things like parameter sniffing problems. In other words, enough to get myself into some serious trouble when something more complicated comes along. 🙂 But also enough to be able to follow directions if you can give me some good advice.
I'm seeing a large number of PAGEIOLATCH_SH waits, resulting in everything from the site being slow to background processes being delayed. If I increase the degree of parallelism, these PAGEIOLATCH_SH waits are mixed with CXPACKET. Our database consists of about a 400 GB archive of read-only data and a 150 GB operational database running on a 4-processor machine with 32 GB of RAM running Windows Server, SQL Server 2005, and nothing else.
The disk is a SAN, RAID-configured, with 1 TB build from 10k RPM, 146 GB disks. Both databases off the same RAID array.
Some data from the Performance Monitor, at about 75% of peak load:
% Processor Time: 15.478
Avg. Disk Queue Length: 20.838
Avg. Disk sec/Read: 0.042
Avg. Disk sec/Write: 0.023
Buffer cache hit ratio: 99.498
Checkpoint pages/sec: 987.523
Lazy writes/sec: 0.680
Latch waits/sec: 198.497
Average Latch Wait Time (ms): 26.045
Lock waits/sec: 5.354
Average Lock Wait Time (ms): 131.016
Full Scans/sec: 591.853
I believe most of the full scans are from a pair of commonly-accessed tables that have either 1 or 2 total rows and contain configuration data.
I know very little about how to properly assemble disks, but I'm happy to learn. Is this a disk problem or a database structure problem? What can I do to improve things?
Thanks for any help!
February 13, 2010 at 7:57 am
You mentioned changing the level of parallelism. Are your queries being spread out across multiple CPU's? If this is an OLTP database (which it sounds like it is), you need to get rid of that. In my experience, and I have seen this a lot, this is a query tuning problem. You were mentioning that you are able to get through finding appropriate indexes and the such, I think you may want to look there, or perhaps, just making sure that that good indexes are being used properly. I really mean that, PARALLELISM is very bad in an OLTP system because it means that the query plan is so complex that the server would rather incur the overhead of combining the results of multiple queries (by splitting the workload across cpu's) than run the whole thing on a single cpu which really suggests a suboptimal query plan.
My first step in resolving this is to look at the actual query plan being generated by your query, and start looking for really thick arrows between the joined tables(or at least the thickest arrow). If you know how many records are in your tables you should be able to figure out where too much data is being read by a part of your plan, and try to isolate your tuning to those areas. Most of the time, there is something that looks like a cartesian product happening in there where mousing over the arrow says that 30,000,000 rows are being read or some huge number, where you know that it should only be looking at 100. This can happen even when there are good indexes but they are not being used correctly, such as on a multi-column index where someone has forgotten to use the first column from the index in the join or in the where clause.
I hope this helps. If this could be the issue and you have difficulty finding it, post your query plan to this thread and I'll take a look. Maybe two heads really are better than one.
Tim Januario
February 14, 2010 at 5:42 am
The following quote is taken from the Microsoft Whitepaper http://msdn.microsoft.com/en-us/library/cc966413.aspx
"Occurs when a task is waiting for a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits of this kind indicate a problem with the disk subsystem."
"Latches are short term synchronization objects. used to synchronize access to buffer pages. PageIOLatch is used for disk to memory transfers.If this is significant in percentage, it typically suggests disk IO subsystem issues. Check disk counters."
So I'm guessing that if the wait PAGEIOLATCH_SH is significant then it could possibly indicate an IO issue.
I did come accross a great script that query's the Sys.dm_os_wait_stats DMV and summed the largest number of waits so you get a good idea of where the hold up is in your system...I can't find it now though...Sorry!!!
At the SAN level does is the LUN shared with something else?
How is your tempdb configured?
Do you get any IO related errors in the log?timeouts and the like?
has the system been performing ok, then fell off a cliff?
Gethyn Elliswww.gethynellis.com
February 14, 2010 at 7:32 am
Those disk stats don't look too bad. sec/read is a little high, sec/write slightly high.
You say that you think the problems come from 2 frequently accessed tables.... If a table is frequently accessed it will likely be in memory and hence there won't be any disk activity resulting from accessing them.
Have you looked at tuning the queries that are often getting latch/cxpacket waits? If necessary, post the queries here with exec plans and we can take a look.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
February 15, 2010 at 10:50 am
without a lot more information you don't have enough spindles. 42ms/io isn't good for an OLTP system and is marginal for olap.
you might be able to tune poorly performing queries to help out here.
could be other problems causing IO stalls.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
February 15, 2010 at 12:30 pm
Thanks to everybody for the suggestions.
As an online game, we're probably closer to OLAP than OLTP. I can't post a query plan of one particularly bad query because the problem is that any of the hundreds of different stored procedures called either by ASP.NET or by back-end modules fail with latching errors.
We do get timeouts when the wait time becomes long enough. It seems like a generic problem rather than one caused by a particular query, although there are a few tables commonly used by a large number of SPs. I guess what I really need help with is knowing what to look at in order to figure out what might be going wrong.
If the disk I/O problem comes from joining tables, is there a good way to improve locality of reference? i.e., to take pairs of tables that are commonly joined and force them to have good locality (does a view do this in a physical sense)?
Finally, I've read someplace that 64GB is the maximum database size recommended by Microsoft. If this is correct, what should I be doing if I have a larger database?
February 15, 2010 at 2:03 pm
What database is experiencing the page latch waits? If it is in tempDB, then this could be a sign of tempDB contention, and I would not be surprised if your tempDB is not set up properly (most aren't).
Check the output of this query the enxt time you're experiencing the page latch waits:
Select session_id,
wait_type,
wait_duration_ms,
blocking_session_id,
resource_description,
ResourceType = Case
When Cast(Right(resource_description, Len(resource_description) - Charindex(':', resource_description, 3)) As Int) - 1 % 8088 = 0 Then 'Is PFS Page'
When Cast(Right(resource_description, Len(resource_description) - Charindex(':', resource_description, 3)) As Int) - 2 % 511232 = 0 Then 'Is GAM Page'
When Cast(Right(resource_description, Len(resource_description) - Charindex(':', resource_description, 3)) As Int) - 3 % 511232 = 0 Then 'Is SGAM Page'
Else 'Is Not PFS, GAM, or SGAM page'
End
From sys.dm_os_waiting_tasks
Where wait_type Like 'PAGE%LATCH_%'
February 15, 2010 at 3:14 pm
I've read someplace that 64GB is the maximum database size recommended by Microsoft
You might have that confused with how much RAM Windows Server 2003, Datacenter Edition supports - 64GB and this link might help you
http://msdn.microsoft.com/en-us/library/ms190731.aspx
In SQL server 2008 the max DB size is 524,272 terabytes
In SQL Server 2005 the max db size is 524,258 terabytes
These values are taken from
http://technet.microsoft.com/en-us/library/ms143432(SQL.90).aspx
I have supported many DB in excess of 64GB...All on standard edition or above, but i'm not sure about other editions down the chain.
Gethyn Elliswww.gethynellis.com
February 15, 2010 at 3:47 pm
Also, what game is it? If it's one that some of play, someone may be willing to donate their time so that their game will run faster. 😀
February 15, 2010 at 3:48 pm
The output appears to consistently be "Is Not PFS, GAM, or SGAM page".
February 15, 2010 at 4:00 pm
Did you notice which database it is in? Using the above query, the database ID is the first number in the resource_description column. You can get the name of the database by using that number as input for the db_name() function.
For example, if the resource_description is 2:8:3249839, then you could get the database name using: Select db_name(2).
In the case of the example above, database ID = 2 is always tempDB.
The results of the query indicate that it not latch contention of the tempDB allocation pages. This means that the contention is on the data/index pages themselves. This could indicate design issues or poorly written queries, or a very high number of transactions causing lock escalation (meaning row locks or page locks are being escalated to table locks).
February 15, 2010 at 4:15 pm
These are all in our main database.
So this is not really a disk issue, but rather likely to be an issue of table design and blocking?
This does occur in a large number of queries, which would seem to imply that it's not an issue of one particularly poorly-designed table but rather a more general problem. But then the next step is to pick a few of these specific queries and look at the execution plan?
February 15, 2010 at 4:20 pm
Whether it is a table problem or a query problem, the next step is the same. Like you said, identify the queries having problems and figure out why.
February 15, 2010 at 4:26 pm
I'm currently monitoring this and I'll post when I have more data (probably when we have a little bit more load, although I'll try to add some load and see if I can reproduce the problems).
We generally use a NOLOCK query hint for almost everything read within the site and a ROWLOCK query hint for almost every update and insert. So, I'd be surprised if lock contention was the issue, because we have large tables in which very few rows are updated/inserted at once (but many different rows are read).
February 15, 2010 at 4:35 pm
Using nolock is helpful, but using rowlock won't help.
SQL tries to use rowlocks for inserts, deletes, and updates. And if SQL thinks that it needs to escalate the lock, it will do so. It will not honor your hint if it wants to escalate the lock. Essentially, it has no effect and is being ignored.
Viewing 15 posts - 1 through 15 (of 20 total)
You must be logged in to reply to this topic. Login to reply