October 21, 2010 at 10:33 am
I am currently trouble-shooting an issue with a .NET web application which acceses SQL 2008 where under load the CPU will suddenly hit 100% from a steady 20-30% and then does not recover. I've read the advice on how to post to this forum and am hopefully providing a lot of what you ask for.
The SQL2008 set up is an n+1 cluster with 5 active nodes and 1 passive, each server runs on Win 2003 and is 16 Core 2.4 Ghz (4 quad core processors) with 32 GB RAM.
It is accessed from a web farm of 16 Win 2003 web servers running .NET 3.5. The web servers are grouped so that 4 servers will utilise one of the active nodes. The 5th active node is used for other purposes related to these same application.
The database is accessed from .NET web site using a combination of ADO executing stored procedures and returning DataSets, LINQ to SQL executing stored procedures and LINQ to SQL selecting direct from a table.
There are 4 searchable tables on the database which hold c.125000 records (definitions attached). There is a further static table that holds postcode related information including Geospatial value (geography datatype) and is used as a lookup, this contains over 1.7 million rows (all UK Postcodes).
There are 3 possible ways in which the application retrieves data from the database
Method 1) Against the Table1 and/or Table2 to identify a set of matching ids using ADO stored procedures.
Method 2) Against Table3 by making an individual LINQ stored procedure call for each id returned by searching against Table1 or Table2 (there is currently a maximum of 300 ids returned per search). (Note: a single connection is opened in the application for the duration of this loop and used by each individual request and explicitly closed afterwards).
Method 3) Selecting a detail record from Table4 using LINQ to SQL direct on to the table
There are over 20 different stored procedures that can be used for method 1) I have attached a sample of these. Additionally 1 single user search will often result in 3 of these procs being executed to get local, regional and national results.
I have included the stored procedure used in Method 2, and the LINQ used for Method 3 as well as proposed SP replacement for this.
The application appears to function perfectly well under reasonable load with fast response times, however under higher load it seems a bottle neck/tipping point is reached and the CPU on a cluster node will suddenly jump up to 100% from a steady 20-30% and seemingly lock up the server, this is because more and more requests are continuing to be sent through to the db as the web servers are handling the load without issue. Not all servers fail at the same time. As soon as the load is directed away from the databases the servers recover. I have tested that the cause is not a build up over time, as if the load is high it can happen within minutes, yet with lower load have had the application run for hours without issue.
In terms of the load level. The level of requests coming through the web application is in the region of 7 RPS per web server - so it's a combined level of only 30 RPS which seems extremely low (the load from the application which sends requests to this one is more like 90 RPS without breaking sweat on an identical farm and cluster though admittedly it has a very different profile of database usage), although each of these requests will result in between 1 and 309 searches of the database - 3 - 9 searches of type 1 above, a further 1-300 of type 2 assuming match results are found OR a single type 3 search.
I have been doing as much reading around performance tuning as has been possible and have already identified additional indexes including cover indexes. My current plan going forward is to remove all LINQ access to the databases mentioned above and replacing this with ADO (reasonably straight forward) however I would welcome any further advice. One particular area I am interested in is the MAXDOP setting. Currently this is set to 16 on the db servers. I have been advised that Microsoft Best Practice is for this to be 8 for a 16 Core server, however, due to the unknown affect changing this setting might have on other applications I think I would have to opt for using the MAXDOP hint on a procedure by procedure basis at this point (with a view to potentially then rolling out as server default). Is it likely that MAXDOP could be an issue in this situation?
Also the performance counters related to the server and SQL that we have been gathering as a matter of coursedon't seem to be offering me any clues as to what the issue is, so a recommendation on what counters to look at would be helpful.
I am unable to run a trace on the server as at the moment we are unable to run this in production due to this issue, what's more when we have tried to run MS diagnostic tools these seem to cause massive CPU usage in themselves and so have been clouding the issue.
October 25, 2010 at 10:21 am
An update to the above. We are running SQL 2008 SP1 with no cumulative update patches (i.e. the build is 10.0.2531.0).
Any possible advice anyone could give on any of the above would be very helpful as we are performing further testing this week, using test scripts only though, not true live user data. If there is anyone who thinks they might be able to help but needs more info, please let me know what you need.
Also, I discovered this issue today http://support.microsoft.com/kb/970823/ and am now wondering whether this could be a potential cause of the problems (possibly clutching at straws) - As far as I know SQL 2008 should have been installed directly on to the machines but our servers are built and hosted by a third party and so I don't know exactly how they were built. I do know they were built in a rush.
I've also discovered that due to the fact we don't have the cumulative updates installed I won't be able to change MAXDOP other than to set it to 1 (http://support.microsoft.com/kb/970399/).
October 25, 2010 at 1:20 pm
October 26, 2010 at 1:21 pm
Thanks pavan. It's good to know this has worked for someone. It is just one of a number of changes that will be tested independently.
October 26, 2010 at 6:23 pm
Verify tempdb is on 4 - 8 equally sized files.
Experiment with maxdop initially at 4. Parallelism is typically your friend, not always.
The goal should be designing queries to have brief and concise predicates, the function is a primary culprit.
The CPU jump is comes from full table scans causing resource contention.
You may also have a runaway webserver(s) essentially performing a denial of service attack on the backend. Write a query grouping connections by hostname to verify even loading.
I suspect code has too low a cardinality for a clustered index. You may want to consider using the ID column.
Tune the stored procs on a test system until they start performing index seeks.
Reorganize the indexes and update stats with full scan when it's possible.
Dan
October 27, 2010 at 8:19 am
EchoKilo: forums are a place for SIMPLE - QUICK problem resolution because they are 'staffed' by people volunteering their time. What you have presented could take days or more to completely resolve. Likely time to engage a GOOD performance tuning consultant who can help mitigate and mentor you on how to do same.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
October 27, 2010 at 8:32 am
You need to move to a 64 bit OS and 64 bit version of SQL server. You are running the boxes out of resources under heavy load and 32bit versions are limited as to what memory they can use in AWE mode.
You did not mention what the SQL server memory utilization is.
The probability of survival is inversely proportional to the angle of arrival.
November 1, 2010 at 11:56 am
@drawlings - thanks for the advice - tempdb is only 1 file but there does not seem to be any contention there even under load, will consider splitting out the files but would it need to be split across more than 1 disk? Due to the way our env is configured using SAN and mount points this wouldn't be possible. I have tried removing the function from the predicates and this had no positive affect under load. The majority of indexes appear to be performing index seeks as we'd already looked at some index tuning, though there is potentially more we need to do in this area, especially in terms of looking at the best candidate for the clustered index. The database is recreated regularly (2 versions run in parallel, one in use the other dormant) as part of the data import via SSIS the indexes are dropped and recreated and statistics updated.
@TheSQLGuru - I appreciate your comment regarding SIMPLE queries being posted in forums and take your point on board. I was simply hoping that it was possible that there might have been something really obvious I was missing, or someone having experienced something similar who might have been able to offer some helpful advice or pointers. If the answer to my question was available in a simple Books Online or forum search I wouldn't have had to post at all. Apologies if I have caused offence.
@sturner - I omitted the key fact that we are using 64 bit SS2208 on 64 bit OS. There is 32GB of Memory available, I have been told that SQL has never used more than 17GB of this.
November 2, 2010 at 3:43 am
You're getting various inappropriate and/or plain inaccurate suggestions thrown in here.
yes, true set the maxdop to 1 will decrease the utilization of cpu usage.
No. There can be circumstances where you may choose to prevent parallel plans by enforcing MAXDOP but as @drawlings mentioned, parallelism is usually your friend.
You need to move to a 64 bit OS and 64 bit version of SQL server. You are running the boxes out of resources under heavy load and 32bit versions are limited as to what memory they can use in AWE mode.
There is nothing in the description from the OP that would lead to this conclusion. The load is not heavy, the data volumes are not particularly high and there is 32gb in play.
Verify tempdb is on 4 - 8 equally sized files.
There is little point in doing this unless there is evidence of tempdb contention. Waitresource entries for 2:1:1 (PFS Page) or 2:1:3 (SGAM Page) are the usual indicators of tempdb problems.
@EchoKilo your best bet to get a more detailed response would be to establish which queries are causing issues (via Profiler trace or failing that, manual execution) and then attach the execution plans to your post. One potential horror that stands out from a quick glance at the scripts, as @drawlings mentioned, is the use of functions in predicates e.g.
ANDdbo.fncWeighting(@BaseQuality,Flag,@FlagMatch,
@FlagWeight,Hours,@HoursLowMatch,
@HoursHighMatch,@HoursWeight) > @QualityThreshold
One possible scenario would be that most of the time (during your 20-30% usage) one or more queries have reasonably efficient plans. Recompilation is then occurring with a different parameter set, resulting in a different plan which performs badly.
November 2, 2010 at 9:11 am
1) Forums, 'staffed' by volunteers, are for short, relatively straight-forward help and advice scenarios. What you have posted could take hours or even days to really get things tweaked out optimally. It would probably take me at least 10 minutes just to really understand all the precursor information you provided! I highly recommend you shift this over to a qualified consultant who can take the time to give you proper guidance and mentoring.
2) Given my very quick review I can say without a doubt that the single most important thing you MUST do is eliminate the scalar UDF!!!! Refactor that code to be inlined with each object that calls it and I can just about guarantee you siginificant improvements in performance.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
November 4, 2010 at 11:50 am
@mark-3 - Thanks for the pointers. See below for having potentially narrowed the problem down a little.
@TheSQLGuru - I would love to bring in a qualified consultant, and in an ideal world that would be the obvious move, this is not an ideal world. I am therefore trying to do all I can including the possibility of bringing in a consultant. Anything else I can do in the meantime though is invaluable, so I therefore thank you and anyone else who has taken time to look at this to whatever level of detail.
----------
As mentioned above I have fortunately been able to perform some further load testing and whilst this haven't solved the problem I have ruled some things out and can therefore hopefully now post a much smaller and less complicated query.
Removal of the function predicate was tested but this unfortunately had absolutely no impact on the situation. I am not ruling this out as part of the long term solution as I take the advice that this is not an ideal situation regardless, just need a little client buy in first.
I have not however rewritten the select to iradicate the UDF, I'm not sure where I'd start other than having to actually split into distinct selects based upon the parameter values, and I didn't have time to investigate this.
November 4, 2010 at 12:16 pm
I now think that the problem lies with Geospatial STDistance searches on the site. (I realise execution plans would be helpful, but need to get and anonymise first and I need to get access to the production environment again. Early next week hopefully.)
The SP's work correctly and efficiently when run standalone BUT when placed under load seem to cause the database server CPU to rise from 20 - 30% level to 100%. This occurs rapidly taking just a few seconds (after a period of the site having been running happily under the same load - this period has varied from minutes to towards an hour). Once the CPU has maxed out the servers do not recover until the load is removed.
This is obviously less than ideal as the queries then time-out. Although, oddly, those with fewer possible matches (<200 say) in the SQL query still return (and quickly - a couple seconds at most) whereas those with greater numbers of possible matching records (>5000) time out (hence thinking it may be the Geospatial stuff working out the distances or similar).
The query is pretty basic as outlined below:
SELECT TOP (@MaxCount) ID, IsNull(Geocode.STDistance(@Location),0) AS Distance,
dbo.fncWeighting(@BaseQuality,Flag,@FlagMatch,@FlagWeight, Hours,@HoursLowMatch,@HoursHighMatch,@HoursWeight) AS Quality, Date
FROM Table1
WITH (NOLOCK)
WHERE Code = @Code
AND Geocode.STDistance(@Location) < @Radius
AND dbo.fncWeighting(@BaseQuality,Flag,@FlagMatch,@FlagWeight, Hours,@HoursLowMatch,@HoursHighMatch,@HoursWeight) > @QualityThreshold
ORDER BY Distance ASC, Quality DESC, ReceivedOn DESC
(The function call included is straight forward numeric processing to return a figure for ordering).
The Spatial index on the table concerned is set up as follows:
USING GEOGRAPHY_GRID
WITH
(
GRIDS =(LEVEL_1 = MEDIUM,
LEVEL_2 = MEDIUM,
LEVEL_3 = MEDIUM,
LEVEL_4 = MEDIUM),
CELLS_PER_OBJECT = 16,
PAD_INDEX = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON,
FILLFACTOR = 90
) ON [PRIMARY]
The table has 100,000 or so records
I'm not too sure if the query should be broken down any or use a temporary table/table variable to get a smaller dataset to work with or similar?
I'm not too sure if the index is suitably well tuned particularly regarding the grids and cells per object any hints would be gratefully appreciated.
Any ideas?
November 4, 2010 at 1:11 pm
The benefits of configuring tempdb to use multiple files are both io throughput and parellelism. Cores can "attach" to their own file, enhancing parallel execution. Adding disks always helps, that's if you mean spindles and/or hypers. The basic goal is more spindles, fewer luns and stripe on the frame not on the host.
In tuning the questionable proc it may be beneficial to spilt the query into smaller sets using a temp table or table variable. Keep in mind that table variables are slower than temp tables when working with large sets. The rule of thumb I developed after testing (years ago) is greater than 5 thousand records don't use table variables. Itzik confirmed these findings in an article on sqlmag.
Lunch time.
Dan
November 4, 2010 at 5:24 pm
Table variables are not inherently slower than temp tables. The issue lies in the optimiser evaluating them as always having a small number of rows, IIRC < 10. That aside, I don't see how dividing the query into smaller sets is going to help this particular issue.
Looking at the revised query, it appears @Code is the only SARG'able predicate. So on the one hand, the issues may be related to widely varying density for that column. Alternatively (or probably in combination), the fncWeighting function (AS Quality) is used as a sort argument. This necessitates every row that qualifies the @Code predicate be passed to the function. If the density of @Code is such that the optimiser elects to table scan, you could have 100k calls to your function.
You mentioned a significant difference between 200 vs 5000 results. Was this tested by varying @Code or by changing the TOP(@MaxCount)?
November 5, 2010 at 3:54 am
The difference was by varying @Code when under load.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply