Managing an Extremely Busy OLTP Server

Question

Managing an Extremely Busy OLTP Server

chandan_jha18

SSChampion

Points: 11330
More actions
February 12, 2014 at 5:45 am

#282275

Hi Folks,
I am the DBA for a server which is used for online gaming. The backend is running on SQL Server 2008 R2 standard edition.
the busiest database. After analysing the server statistics for over a couple of months, here are the prominent wait-types observed in the order of their total wait times:
WRITELOG
PAGEIOLATCH_SH
BACKUPIO
PAGEIOLATCH_EX
ASYNC_IO_COMPLETION
The database is a pure OLTP with no bulk processes running and the workload concentrates mostly on inserts by mutiple application threads. After some research and analysis, I found that the heaviest 4-5 tables which store the data and grows very rapidly,
they all have a primary key on an identity column incrementing by 1 and thus the default clustered index on it.
While this may be a good practice for most of the databases(I have witnessed a lot of similar ones), having an ever increasing key as clustered indexes might be causing multiple threads to add rows to the same page at the leaf level thus leading to 'last page inserts' contention as witnessed by the latch statistics as shown above.
I could find some solutions proposed to this problem:
1) Having the table partitioned- Rules out because of edition limitations
2) Having a clustered index on some other column which is not evenly increasing, something like GUID or session ID so that the contention is not always on the last page if the leaf. This might require addition of some new column and a schema change is something the management might not find a convincing way.
Now, at this point, I need a way to approach this problem. I had read that having multiple files in the filegroup can help scale up the writes because then the table will get split in different drives(SAN LUN in my case). although not sure if this will indeed help.
Looking for some advice on tackling this tricky problem. Please note that this is too intensive system and even a 3 second lock can lead to escalations at times until I cover it up cleverly.:-P
thanks
Chandan

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 1

There could be a lot going on...If WRITELOG is your top metric here, I'd have to ask if all your log files from these busy DBs are all writing to the same LUN? Since you're dealing with a SAN I'd ask if how the disks in the pool were configured/optimized: were the disks all properly aligned? what RAID type were they configured as? How did your SAN admin carve out the LUNS from the pools? It could be that you've got either too many log files being written to the same LUN and that LUN/pool was configured for IOPS based upon random read/writes instead of sequential writes, etc.

Very hard to tell really without knowing how the SAN was configured.

Also, since you see a lot of PAGEIOLATCH's this could indicate SQL attempting to load a lot of data pages in memory, so I'd ask about your buffer pool (high or low?) and how much RAM does this server have available for SQL Server?

Where are the backups writing? Locally or to a network location?

To start with I would focus on determining how many log files are writing to "where", how much RAM is available, and how the disk subsystem was configured...before going to far.

Just my .02 though 🙂

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Grant Fritchey SSC Guru Points: 398882 More actions · Answer 2

Personally I wouldn't sweat the hot spot aspects of identity columns. That used to be a pretty major problem back in the day (say, SQL Server 7 and earlier), but they've rearchitected the storage mechanisms so that isn't the pain point it was previously. Instead, I'd focus on traditional query tuning issues. You look to have lots of waits on reads. Are the queries using the indexes you have in place optimially? Do you need some addition or different indexes? What does the query code look like, etc.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

chandan_jha18 (2/12/2014)
While this may be a good practice for most of the databases(I have witnessed a lot of similar ones), having an ever increasing key as clustered indexes might be causing multiple threads to add rows to the same page at the leaf level thus leading to 'last page inserts' contention as witnessed by the latch statistics as shown above.

Last page insert hotspot would cause page latch waits. Specifically PAGELATCH_EX. That's not a wait in the list you provided. PageIOLatch waits are IO-related waits.

In fact, all 5 of the waits you list are IO-related, personally that's where I'd start looking if this were my system.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

chandan_jha18 SSChampion Points: 11330 More actions · Answer 4

I sincerely apologize to have trimmed a few counters by mistake from the list I provided. Following are top 15 results ordered by their counts:

WRITELOG

LOGMGR_QUEUE

PAGELATCH_EX

CXPACKET

SLEEP_TASK

BACKUPIO

OLEDB

SLEEP_BPOOL_FLUSH

PAGEIOLATCH_SH

LATCH_EX

BACKUPBUFFER

PAGEIOLATCH_EX

SOS_SCHEDULER_YIELD

PREEMPTIVE_OS_AUTHENTICATIONOPS

PAGELATCH_SH

The above are the top 15 waits and I apologize for my little ignorance of giving not perfect data earlier.

Please advice on the same.

chandan_jha18 SSChampion Points: 11330 More actions · Answer 5

MyDoggieJessie (2/12/2014)
There could be a lot going on...If WRITELOG is your top metric here, I'd have to ask if all your log files from these busy DBs are all writing to the same LUN? Since you're dealing with a SAN I'd ask if how the disks in the pool were configured/optimized: were the disks all properly aligned? what RAID type were they configured as? How did your SAN admin carve out the LUNS from the pools? It could be that you've got either too many log files being written to the same LUN and that LUN/pool was configured for IOPS based upon random read/writes instead of sequential writes, etc.
Very hard to tell really without knowing how the SAN was configured.
Also, since you see a lot of PAGEIOLATCH's this could indicate SQL attempting to load a lot of data pages in memory, so I'd ask about your buffer pool (high or low?) and how much RAM does this server have available for SQL Server?
Where are the backups writing? Locally or to a network location?
To start with I would focus on determining how many log files are writing to "where", how much RAM is available, and how the disk subsystem was configured...before going to far.
Just my .02 though 🙂

I do not have much idea on SAN configuration here and the team here responsible for system administration does not have much details too. I apologize for being in this situation and not able to reveal much information here.

Yes there is a problem with memory and it is sitting at just 20 GB whereas the busiest database size reaches 400 GB with 2 months of data plus there are backups running everyday so I know that memory is a factor which is limiting the performance but at my front, I can only provide the data to the management and it goes in endless loops and chains before deciding to spend few dollars on hardware. However, the office parties with expensive food have their own budgets but for anything related to the server has to go through files of different officers and usually happens once a year or so. 😛

Backups are writing locally as well as over the network, so I would like to cut simultaneous operations and instead focus on copying backup files to network later.

Thanks

Chandan

chandan_jha18 SSChampion Points: 11330 More actions · Answer 6

GilaMonster (2/12/2014)
chandan_jha18 (2/12/2014)
While this may be a good practice for most of the databases(I have witnessed a lot of similar ones), having an ever increasing key as clustered indexes might be causing multiple threads to add rows to the same page at the leaf level thus leading to 'last page inserts' contention as witnessed by the latch statistics as shown above.
Last page insert hotspot would cause page latch waits. Specifically PAGELATCH_EX. That's not a wait in the list you provided. PageIOLatch waits are IO-related waits.
In fact, all 5 of the waits you list are IO-related, personally that's where I'd start looking if this were my system.

I changed the list. Please forgive my ignorance of using a filter in collecting my wait stats.

Thanks

Chandan

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 7

chandan_jha18 (2/13/2014)
I sincerely apologize to have trimmed a few counters by mistake from the list I provided. Following are top 15 results ordered by their counts:

Over what time period?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

chandan_jha18 SSChampion Points: 11330 More actions · Answer 8

GilaMonster (2/13/2014)
chandan_jha18 (2/13/2014)
I sincerely apologize to have trimmed a few counters by mistake from the list I provided. Following are top 15 results ordered by their counts:
Over what time period?

The server has not been rebooted for the last 60 days. so this data does reflect a good picture IMO.

Regards

Chandan

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 9

chandan_jha18 (2/13/2014)
GilaMonster (2/13/2014)
chandan_jha18 (2/13/2014)
I sincerely apologize to have trimmed a few counters by mistake from the list I provided. Following are top 15 results ordered by their counts:
Over what time period?
The server has not been rebooted for the last 60 days. so this data does reflect a good picture IMO.

It's a 60 day total of everything that happened (online operations, maintenance, backups, etc).

Start by recording, every hour, every half hour or some small interval, the top 10 useful waits (so no sleep tasks, no expected waits) incurred during that interval, wait time and incidents of that wait. With that you can start getting a detailed view of what waits are prevelent at what times, which correlate with peak activity, which correlate with slow periods, which are just a couple of processes waiting longish times and which are lots of processes waiting a little. That's a lot more useful IMHO than a single wait figure aggregated over 2 months of all sorts of activity.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass