sql server 2008 queries with intermittent poor performance

  • Hi there,

    On one of our customers, we've been having some performance issues on our application. It's a .NET 3.5 web app that is consuming and updating data on a SQL Server database. Currently our production environment consists of a Windows 2008 R2 machine as the front end, and a SQL Server 2008 R2 cluster on the back end. Our app uses COM+ and MSDTC to connect to the database.

    Here's what's happening: our end users sometimes complain of slowness in the application. Some pages take some more time to load than would be expected. While trying to figure out what's happening, I managed to find out some strange behaviour on the database side that may be the cause for the performance degradation. I noticed that sometimes there are some SQL statements that take a lot more time to run that what would be expected. I managed to identify some of these statements (mainly it's invocations of some of our application's stored procedures) using a profiler trace (with TSQL_Duration template) to identify the long running queries.

    The problem is that when I run these sp's directly on the database on SQL Management Studio sometimes they do take long (about 7 / 8 seconds), other times they are fast (under 1 sec.). I don't know why this happens and it's driving me nuts, because the SQL machine (4 core, 32 GB) is not being used by any other applications, and these queries should not take this long to run.

    Not being a DBA or a SQL Server guru, I've been trying to look at some stuff that may help me understand the problem. Here's the steps I've taken to try and sort out the problem and what I found out so far:

    - All of the TSQL code called by the application is written in stored procedures.

    - I identified some of the long running queries on SQL Server profiler, however when I run these on Management Studio they either take long to run (from 4 to 10 secs.), or run quickly (under 1 sec.). I am running the exact same queries with the same data passed in the parameters. These queries are mainly sp's with select statements in them.

    - I tried looking at the waits and queues statistics to try and figure out if there are processes waiting on some resources. I ran the following query:

    WITH Waits AS

    (SELECT

    wait_type,

    wait_time_ms / 1000.0 AS WaitS,

    (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,

    signal_wait_time_ms / 1000.0 AS SignalS,

    waiting_tasks_count AS WaitCount,

    100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,

    ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT IN (

    'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',

    'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',

    'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

    'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',

    'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',

    'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',

    'BROKER_RECEIVE_WAITFOR', 'ONDEMAND_TASK_QUEUE', 'DBMIRROR_EVENTS_QUEUE',

    'DBMIRRORING_CMD', 'BROKER_TRANSMITTER', 'SQLTRACE_WAIT_ENTRIES',

    'SLEEP_BPOOL_FLUSH', 'SQLTRACE_LOCK')

    )

    SELECT

    W1.wait_type AS WaitType,

    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,

    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,

    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,

    W1.WaitCount AS WaitCount,

    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage,

    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S,

    CAST ((W1.ResourceS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgRes_S,

    CAST ((W1.SignalS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgSig_S

    FROM Waits AS W1

    INNER JOIN Waits AS W2 ON W2.RowNum <= W1.RowNum

    GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage

    HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold

    GO

    Here's what I found out:

    - After I reset the stats using DBCC SQLPERF (about 1 or 2 hours after), the wait types I have most is SOS_SCHEDULER_YIELD and WRITELOG

    - Over time (after about 1 day execution), the wait types that happen the most on the database are CXPACKET (67%) and OLEDB (17%), even though the average wait time for each is not long. I also noticed that the longer running statements identified on SQL Profiler are calls to sp's that return more than one resultset (often 3). Can there be a paralellism problem here ? Is there any way I can try to identify if this is the cause of the problem ?

    - I've read somewhere that OLEDB waits can be caused by calls to OLEDB resources like linked servers. We do have a linked server to connect with a Indexing Services machine (MSIDXS), however none of the statements identified as long running make use of that linked server.

    - The higher average wait time I have is for LCK_M_X type waits (about 1.5 secs average), but these wait types don't happen very often compared to other types (for example, 64 LCK_M_X waits vs 10,823 CXPACKET waits on the same period of time).

    - One thing I noticed is that the MSDTC service is not clustered. SQL Server service is clustered but not MSDTC. Can there be a performance hit because of this ? We are using MSDTC because our app uses Enterprise Services (DCOM) to access the database, but the servers were not installed and configured by us, but by our client.

    Can anyone help me make some more sense of this data ? Can anyone give me a hand on understanding what may be happening ? Is there something I can do on the server to try and figure things out ? Should I talk to the application development team ?

    Any help will be appreciated, as this is driving me nuts... 🙁

    Regards,

    Joao Maia

  • Hi,

    That sounds quite familiar to me...

    Is the database being updated or is it read only ??

    Have you minimised locking in the procedures ??

    The high cxpacket value does suggest some parallelism problem. Can you change this and test this offline ??

    Are there many large queries being run against the server ??

    How much memory have you allocated to the sql server ??...although I don't think this is the issue.

    How many users are accessing the site per hour..or per day ??

    Is the sql server doing anything esle other than running SQL Server ???

    Graeme

  • http://www.simple-talk.com/sql/performance/finding-the-causes-of-poor-performance-in-sql-server,-part-1/

    http://www.simple-talk.com/sql/performance/finding-the-causes-of-poor-performance-in-sql-server,-part-2/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi,

    Thanks for your quick reply. I will try to answer your questions as best as I can:

    Is the database being updated or is it read only ??

    [jmaia] It is being updated as well. Most of the identified long running queries are read operations, though.

    Have you minimised locking in the procedures ??

    [jmaia] Yes. At least all the selects in the sp's use NOLOCK.

    The high cxpacket value does suggest some parallelism problem. Can you change this and test this offline ??

    [jmaia] It's highly unlikely I can test this offline because it's a production environment, so stopping it and testing offline is not really easy, unless of course the test is something that takes very little time. And I don't get this behavior on the testing and quality environments.

    Are there many large queries being run against the server ??

    [jmaia] If by large queries you mean queries that return lots of results, then the answer is not really. Most queries do not return large resultsets (900 / 1000 rows at most). If you mean complex queries, yes there are some complex queries with JOINS / UNIONS / DISTINCT statements.

    How much memory have you allocated to the sql server ??...although I don't think this is the issue.

    [jmaia] As I said the server was configured by our client's infrastructure team. From what I see they allocated all of the memory of the server to SQL Server, so I'm guessing it uses as much as it can.

    How many users are accessing the site per hour..or per day ??

    [jmaia] Not many. We're using the same connection string for all users, and it's a SQL Server authentication so we have the same sql server account for all connections. If Enterprise Services is working as it should, then we have connection pooling (we haven't disabled it explicitly).

    Is the sql server doing anything esle other than running SQL Server ???

    [jmaia] I don't think so. At least I don't see any other significant process in Task Manager, and the processor is idle most of the time. There seems to be some disk activity though, most of it coming from SQL Server. And this is the only database running on the server.

    One other thing I forgot to mention is that the application is using COM+ (Enterprise Services) and not setting explicitly the Transaction Isolation Level (which by default is Serializable). This can account for the LCK_M_X wait types that I get (even though they are a small number compared to other wait types).

    Regards,

    Joao Maia

  • Thanks GilaMonster, I'll look into those links... 🙂

    Regards,

    Joao Maia

  • Hmm..ok.

    And your test environments are configured the same, albeit with less resources.

    Have you load tested the application at all ??

    This probably need a structured approach..and the articles posted by Gail would certainly be the best place to start.

    I will give it some further thought though..

    Regards

    Graeme

  • Please post the actual execution plan. Table DDL, including keys and indexes.

  • jmaia (7/11/2011)


    Have you minimised locking in the procedures ??

    [jmaia] Yes. At least all the selects in the sp's use NOLOCK.

    Is that a good idea?

    http://sqlblog.com/blogs/andrew_kelly/archive/2009/04/10/how-dirty-are-your-reads.aspx

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Graeme,

    We did some load tests on the application in a lab environment with the same architecture as the production environment, but with less resources and could not find any bottleneck during the tests. In fact the application worked ok in that lab environment (less resources) for a lot more users than the number that is actually connected on the production environment (more resources), which is one of the reasons I am puzzled...

    Load Test: Less Resources, More Users - ok

    Production: More Resources, Less Users - performance problems (intermittent)

    I really don't get it... 🙁

    Still I will look at the links...

    Joao Maia

  • What is the frequency of index defrag and stats updates?

    What are the column's datatype in the wehere clause?

    Are you waiting on something to send us the actual execution plan?

  • Very impressive articles authored by you as usual. 😎

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Ninja's_RGR'us (7/11/2011)


    What is the frequency of index defrag and stats updates?

    What are the column's datatype in the wehere clause?

    Are you waiting on something to send us the actual execution plan?

    Hi,

    I'm not on the client's building anymore and I can't access the database from where I am now, but I'll try and follow your articles for posting data for performance issues and do it tomorrow.

    It might take a while as I will have to dig into the sp's code, though... 😐

    Thanks for your help so far.

    jm

  • jmaia (7/11/2011)


    Ninja's_RGR'us (7/11/2011)


    I'm not on the client's building anymore and I can't access the database from where I am now, but I'll try and follow your articles for posting data for performance issues and do it tomorrow.

    It might take a while as I will have to dig into the sp's code, though... 😐

    Hello again,

    I finally managed to gather some more data about the statements that I mentioned on the preceding posts. These are some old sp's (about 5 or 6 years old) that have been devloped by someone who's not in the company anymore so it took me some time to try and figure out what the problematic statements are, but I believe I managed to find out.

    I'm attaching a zip file that has the following information:

    - PagingDynamicQuerySample.sql - contains the culprit statements that are taking intermittently long

    - sp_help_* - contains the table definitions for the tables used on the query in the above statements

    - fn* - contains function definitions for used functions int the query

    - CursorQueryPlan - Actual Execution Plan for the query in the Cursor

    - OverallDynamicQueryPlan - Actual Execution Plan for the statements in PagingDynamicQuerySample.sql

    Here are some data regarding the actual production table rowcounts as well:

    - Table: relRegistoPasta - about 810,000 rows

    - Table: tblRegisto - about 660,000 rows

    - Table: tblPerfil - about 30 rows

    - Table: tblAlerta - about 80 rows

    - Table: tblUtilizadorDetalhe - about 520 rows

    - Table: tblEntidades - about 3,000 rows

    - Table: @tblPK - this table is constructed on the statements but it should have 30 rows

    Regarding execution times, what happens is that sometimes it executes in about 1 sec., other times it takes about 7 secs, and sometimes even more (10, 20 secs). I have noticed this by running it on SQL Management Studio.

    Now before you start hitting me on the head because of the SQL code that I sent, let me tell you that this code is rather old (about 5 or 6 years), and has been written by someone who does not work in our company anymore and hasn't really been maintained over the years. From talking to people on our product development team what I learned about it is the following:

    - This code is run when a user is navigating on our application. The user navigates an explorer-like tree with folders and the code I sent is used to retrieve the contents of each folder (which then get displayed in a paged grid on the application).

    - Paging is done in the SQL query, this code itself retrieves paged results. It is generated by a paging sp that receives a string with columns to retrieve, a string with the tables, a string with the conditions, a string with the column to sort by, the page to retrieve, and the pagesize and then generates the code in PagingDynamicQuerySample.sql dynamically and does an exec of that code to retrieve only the required page from the resultset. This sp is used throughout the application whenever a result is needed for a paged grid.

    - There is some weird stuff on the conditions part of the query (like 0 = 0 and the like), but I believe it is because it is generated dynamically.

    - This code is used quite a lot because users are constantly navigating on the application. We have about 50 / 60 concurrent users on this client.

    Anyway, this is all I could gather for now. If there are any more questions I'll pick the brains of the development team and try to figure out what I can.

    I'm losing sleep over this stuff... 🙁

    Thanks for your help.

    jm

  • Cursor query :

    Good Enough Plan Found (not awefull but better if you don't have this warning)

    estimate 4 vs 354 found (can be a real big problem if you have more rows or even just that little amount)

    CONVERT_IMPLICIT(tinyint,[MAI_Smartdocs_MAI].[dbo].[tblUtilizadorDetalhe].[VeRegistosTrabalho],0) not sure if this is an issue but it could be if it's in a join or filter.

    Last query

    Good Enough Plan Found

    30 rows found vs 1 estimate (imagine if the query returns 1000, that could get hellishly slow)

    I'd consider changing the table variable to temp table with PK (so the optimzer gets better stats to work with).

    Instead of the cursor you could do INSERT INTO #tmp SELECT TOP (What you want here) Columns FROM (cursor query).

    CONVERT_IMPLICIT(tinyint,[MAI_Smartdocs_MAI].[dbo].[tblUtilizadorDetalhe].[VeRegistosTrabalho],0) Again might or might not be an issue

    I'd change the code to INNER JOINS instead of all the wheres. You might also have an accidental cross join in there that's hidden by the distinct.

    ((0 = 1 and tblRegisto.flagPasta <> 'R') or (0 = 0)) and ...

    That kind of filtering only gets optimized correctly if the query plan is rebuilt at every run. That could very well be your source of the problem (along with everything else). Either use dynamic sql (sp_executesql) or use with recompile for the whole procedure. This will incure a compile hit but I don't think you have a choice here.

    I don't have time to a complete rework this with you. It would probably take a couple days over this forum. If you don't get it solved over 1-2 days with us, you might consider calling a tuning expert in to come in your shop.

    Happy hunting!

  • Ninja's_RGR'us (7/12/2011)


    Good Enough Plan Found (not awefull but better if you don't have this warning)

    That's not a warning. It's saying that the optimiser stopped because it had found a good enough plan. That's the one you want to see (the other option is Timeout)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply