Strange Job Slowdown 30 secs to 30 minutes

  • Greg Edwards-268690 (8/9/2013)


    Stefan Krzywicki (8/9/2013)


    Greg Edwards-268690 (8/9/2013)


    I think part of it being suspect of paging lends itself to being run after that largest job of the day.

    At least that is what I get out of the description.

    Kind of sounds like it is waiting for something.

    By chance, is there some other machine involved?

    Yeah, I'm thinking the timing after the largest job has something to do with it.

    It could also be a memory allocation problem. There was no free RAM on the machine, it was all taken by the OS and SQL Server. We were given 38Gig RAM, but it turns out it is running Win 2008 and that can only use 32. We'd given SQL Server 28 of the 38 thinking there was 10 left. We've reduced that to 22 to see what happens and now there's some free RAM.

    Standard Edition of Win 2008 R2, same as I was running.

    Be sure to double check if SSAS settings do not also contribute.

    From my experience, SSAS seems more of a reference.

    Also take into account VM Host overhead.

    I think you're headed the right way.

    Less may mean it clears out old cache before tipping, if that's what it ends up being.

    I'm hoping, or that there's more RAM for SSIS.

    There's no SSAS running at night. Same for SSRS.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • If the services are on, they are running and available.

    We would have people think the time to run a big query was the middle of the night.

    So do not assume.

    Going on a little vacation, don't know how much I'll be checking things.

    So if I disappear, I must be enjoying the north shore.

    I don't know how much you've done with traces, but they can be scheduled.

    And between perfmon and profiler, you have a wide array of things that can be checked.

    Much better than if someone is using SCOM and doing 15 minute intervals.

  • Greg Edwards-268690 (8/9/2013)


    If the services are on, they are running and available.

    We would have people think the time to run a big query was the middle of the night.

    So do not assume.

    Going on a little vacation, don't know how much I'll be checking things.

    So if I disappear, I must be enjoying the north shore.

    I don't know how much you've done with traces, but they can be scheduled.

    And between perfmon and profiler, you have a wide array of things that can be checked.

    Much better than if someone is using SCOM and doing 15 minute intervals.

    You're right about not assuming, I asked everyone with rights to the server to check. No-one is running anything then.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Stefan Krzywicki (8/10/2013)


    Greg Edwards-268690 (8/9/2013)


    If the services are on, they are running and available.

    We would have people think the time to run a big query was the middle of the night.

    So do not assume.

    Going on a little vacation, don't know how much I'll be checking things.

    So if I disappear, I must be enjoying the north shore.

    I don't know how much you've done with traces, but they can be scheduled.

    And between perfmon and profiler, you have a wide array of things that can be checked.

    Much better than if someone is using SCOM and doing 15 minute intervals.

    You're right about not assuming, I asked everyone with rights to the server to check. No-one is running anything then.

    I had quite a few users.

    Catching the activity either in a trace, or from windows logs as Kerberos traversed boxes, then asking directly what they were running at XXX time, seemed to be the only way I could get a response.

    Sounds like you are talking something other than users of the data.

    First stop - Castle Danger to pick up a growler.:-D

  • Greg Edwards-268690 (8/10/2013)


    Stefan Krzywicki (8/10/2013)


    Greg Edwards-268690 (8/9/2013)


    If the services are on, they are running and available.

    We would have people think the time to run a big query was the middle of the night.

    So do not assume.

    Going on a little vacation, don't know how much I'll be checking things.

    So if I disappear, I must be enjoying the north shore.

    I don't know how much you've done with traces, but they can be scheduled.

    And between perfmon and profiler, you have a wide array of things that can be checked.

    Much better than if someone is using SCOM and doing 15 minute intervals.

    You're right about not assuming, I asked everyone with rights to the server to check. No-one is running anything then.

    I had quite a few users.

    Catching the activity either in a trace, or from windows logs as Kerberos traversed boxes, then asking directly what they were running at XXX time, seemed to be the only way I could get a response.

    Sounds like you are talking something other than users of the data.

    First stop - Castle Danger to pick up a growler.:-D

    We're a very small shop. : -)

    We're running a trace too though.

    Have fun at the shore.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • OK, I have some more data.

    We stopped all backups of all kinds across the entire system. That's not it.

    We're using Windows Server 2008 OS Standard. This limits us to 32Gig RAM.

    I've been told that Windows Server 2008 manages page files automatically. We're not hitting 50% ram usage and only hit 60% CPU. Both of the maximums are in the job immediately preceding.

    At one point, the job ran for 5 hours, hung on the first INSERT statement. I stopped the job & ran each section of the query. They ran in under a second. I ran the full query, it ran in under a second. I ran the job, it took under 5 minutes, which is appropriate given the accumulated amount of data.

    We're running a trace and perfmon and see no conflicts, other jobs running or anomalous system usage.

    Are there specific things we should have turned on to track in either the Perfmon or the trace? I'm working with our DBA, he's the one running these.

    All of this only started when we went to VM, so I'm thinking it has to be related to that somehow. There were no code changes.

    Thanks for your help and any other ideas.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • SQL Trace and Perfmon on the guest are only showing you part of the picture.

    Even if you've gone for a 1 to 1 mapping of host/guest, these are the things I'd be looking at:

    Perfmon/VMWare resource graphs for the host

    SAN activity graphs

    Any scheduled activity on the VMWare host (you say that backups are disabled, but does that include VMWare snapshots?)

    Any scheduled activity on the SAN.

    Also check:

    Event viewer on guest and host

    SQL Server error log on guest - performance problems can be logged here, particularly long I/Os etc.

    Does the SQL trace show which particular statement is taking the time?

  • HowardW (8/13/2013)


    SQL Trace and Perfmon on the guest are only showing you part of the picture.

    Even if you've gone for a 1 to 1 mapping of host/guest, these are the things I'd be looking at:

    Perfmon/VMWare resource graphs for the host

    SAN activity graphs

    Any scheduled activity on the VMWare host (you say that backups are disabled, but does that include VMWare snapshots?)

    Any scheduled activity on the SAN.

    Also check:

    Event viewer on guest and host

    SQL Server error log on guest - performance problems can be logged here, particularly long I/Os etc.

    Does the SQL trace show which particular statement is taking the time?

    The resource usage I mentioned above was for the guest, but we also looked at resource graphs for the host and they parallel the ones for the guest.

    I don't believe we're running any VMware snapshots, but I'll check.

    Edit: Confirmed, we are not running any snapshots.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Stefan Krzywicki (8/13/2013)


    OK, I have some more data.

    We stopped all backups of all kinds across the entire system. That's not it.

    The things you're not doing definitely include system state backup and windows 2008 server backup on both host level and guest level?

    Plus I'll definitely second HowardW on event viewer at both levels, and on SQL error log.

    Tom

  • L' Eomot Inversé (8/13/2013)


    Stefan Krzywicki (8/13/2013)


    OK, I have some more data.

    We stopped all backups of all kinds across the entire system. That's not it.

    The things you're not doing definitely include system state backup and windows 2008 server backup on both host level and guest level?

    Plus I'll definitely second HowardW on event viewer at both levels, and on SQL error log.

    We normally use EMC backups (which I don't like, but that's neither here nor there) and stopped both the SQL Server backups & file system backups on the host and guest levels.. There's no snapshot being used.

    I've gone through everything in the event viewer and don't see anything likely. Nothing at the time of the slowdown. It still feels like resource locking or cpu power saving/sleeping to me. I'm told there's no power save being used on the cpus though.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Stefan Krzywicki (8/13/2013)


    L' Eomot Inversé (8/13/2013)


    Stefan Krzywicki (8/13/2013)


    OK, I have some more data.

    We stopped all backups of all kinds across the entire system. That's not it.

    The things you're not doing definitely include system state backup and windows 2008 server backup on both host level and guest level?

    Plus I'll definitely second HowardW on event viewer at both levels, and on SQL error log.

    We normally use EMC backups (which I don't like, but that's neither here nor there) and stopped both the SQL Server backups & file system backups on the host and guest levels.. There's no snapshot being used.

    I've gone through everything in the event viewer and don't see anything likely. Nothing at the time of the slowdown. It still feels like resource locking or cpu power saving/sleeping to me. I'm told there's no power save being used on the cpus though.

    Obvious question, but has anyone actually got up at this time and cast their eyes on the problem as it's happening?

    There are some fundamental pieces of information that you don't seem to have been able to find yet, like the specific statement that's not returning and what the wait events are, whether there is any blocking etc. or whether this problem is not in the SQL Engine at all and it's all hung in SSIS.

    You can obviously build up monitoring to record this stuff historically (or get something 3rd party like SQL Monitor/Diagnostic Manager), but eyes on the problem as it's happening is always a good way to troubleshoot!

  • HowardW (8/13/2013)


    Stefan Krzywicki (8/13/2013)


    L' Eomot Inversé (8/13/2013)


    Stefan Krzywicki (8/13/2013)


    OK, I have some more data.

    We stopped all backups of all kinds across the entire system. That's not it.

    The things you're not doing definitely include system state backup and windows 2008 server backup on both host level and guest level?

    Plus I'll definitely second HowardW on event viewer at both levels, and on SQL error log.

    We normally use EMC backups (which I don't like, but that's neither here nor there) and stopped both the SQL Server backups & file system backups on the host and guest levels.. There's no snapshot being used.

    I've gone through everything in the event viewer and don't see anything likely. Nothing at the time of the slowdown. It still feels like resource locking or cpu power saving/sleeping to me. I'm told there's no power save being used on the cpus though.

    Obvious question, but has anyone actually got up at this time and cast their eyes on the problem as it's happening?

    There are some fundamental pieces of information that you don't seem to have been able to find yet, like the specific statement that's not returning and what the wait events are, whether there is any blocking etc. or whether this problem is not in the SQL Engine at all and it's all hung in SSIS.

    You can obviously build up monitoring to record this stuff historically (or get something 3rd party like SQL Monitor/Diagnostic Manager), but eyes on the problem as it's happening is always a good way to troubleshoot!

    I've checked while it is running. There's nothing blocking. It is the initial INSERT query in the SSIS package that is hanging.

    I plan on turning logging on in the SSIS package, just have to figure out what to track. I don't see any wait events. How would one look at these?

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • I think you should invest whether the query plan at the problematic point in time is a fresh one or a new one. You can use this query to search the cach:

    SELECT est.text, a.attrlist, qs.*

    FROM sys.dm_exec_query_stats qs

    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) est

    CROSS APPLY (SELECT epa.attribute + '=' + convert(nvarchar(127), epa.value) + ' '

    FROM sys.dm_exec_plan_attributes(qs.plan_handle) epa

    WHERE epa.is_cache_key = 1

    ORDER BY epa.attribute

    FOR XML PATH('')) AS a(attrlist)

    WHERE est.text LIKE '%<some significant query text here>%'

    AND est.text NOT LIKE '%sys.dm_exec_plan_attributes%'

    Beware that the same query can have multiple cache entries, because of different plan attributes. Typically, when you run queries from SSMS you get a different plan than in an application, because of different settings for ARITHABORT.

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Erland Sommarskog (8/13/2013)


    I think you should invest whether the query plan at the problematic point in time is a fresh one or a new one. You can use this query to search the cach:

    SELECT est.text, a.attrlist, qs.*

    FROM sys.dm_exec_query_stats qs

    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) est

    CROSS APPLY (SELECT epa.attribute + '=' + convert(nvarchar(127), epa.value) + ' '

    FROM sys.dm_exec_plan_attributes(qs.plan_handle) epa

    WHERE epa.is_cache_key = 1

    ORDER BY epa.attribute

    FOR XML PATH('')) AS a(attrlist)

    WHERE est.text LIKE '%<some significant query text here>%'

    AND est.text NOT LIKE '%sys.dm_exec_plan_attributes%'

    Beware that the same query can have multiple cache entries, because of different plan attributes. Typically, when you run queries from SSMS you get a different plan than in an application, because of different settings for ARITHABORT.

    This one has 5. I'm not sure how to read this. Do I need to run it after each time the sp in question runs? Is there a viewer for the results?

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Ah, so you have a stored procedure. In that case, there is a somewhat simpler query:

    with cte as (

    select substring(est.text, (qs.statement_start_offset + 2)/2,

    CASE qs.statement_end_offset

    WHEN -1 THEN datalength(est.text)

    ELSE (qs.statement_end_offset - qs.statement_start_offset + 2) / 2

    END) as stmt,

    qs.* from sys.dm_exec_query_stats qs

    cross apply sys.dm_exec_sql_text(qs.sql_handle) est

    where objectid = object_id('your_sp')

    )

    select * from cte

    ORDER by statement_start_offset

    The idea is the same, just a little more exact. This gives you the plan information for all statements in the procedure. Since it's hanging on the first INSERt statement, that is the one of interest.

    You don't need to run the query right after the fact. The interesting columns are execution_count and creation_date. Is creation_date in conjunction to the problems? Before? After?

    As for why there are five, this due to that you have different SET options and a few more things. From SSIS you probably run with SET ARITHABORT OFF, while this setting is OFF in SSMS.

    Does the procedure take parameters? Does the parameter values change between calls?

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

Viewing 15 posts - 31 through 45 (of 50 total)

You must be logged in to reply to this topic. Login to reply