May 25, 2022 at 7:55 am
Hi,
Whenever I get an high cpu alert , be it non-working hours or midnight, I want to run sp_whoisactive and log into a table or a server side trace or xe for 10-mins and stop it. How to do it? Inside a job what logic I need to keep to stop the trace or stop a job?
All I am trying to do is collect sql stmts during the time of issue.
Best Regards,
Bob
May 25, 2022 at 10:58 am
you don't do it when you have high cpu - you have it running all the time with 10,20, 30, 60 seconds interval or so depending on how busy your system is and on what you consider a excessive time for a sql to run.
see https://www.brentozar.com/archive/2016/07/logging-activity-using-sp_whoisactive-take-2/
I also use the following - again on a SQL Server job running every 10 seconds.
create table dbo.DBA_QueriesSnapshot
(
, Snapshot_Time datetime not null
, database_id smallint not null
, Database_Name nvarchar(128) null
, session_id smallint not null
, login_name nvarchar(128) not null
, host_name nvarchar(128) null
, program_name nvarchar(128) null
, client_interface_name nvarchar(32) null
, start_time datetime not null
, status nvarchar(30) not null
, command nvarchar(32) null
, wait_type nvarchar(60) null
, open_transaction_count int not null
, memory_usage int not null
, deadlock_priority int not null
, row_count bigint not null
, reads bigint not null
, writes bigint not null
, WaitTimemilliseconds int not null
, resource_description nvarchar(3077) null
, blocking_session_id smallint null
, ElapsedTimemilliseconds int not null
, statement_text nvarchar(max) null
, FullText nvarchar(max) null
, query_plan xml null
)
with
(
data_compression = page
)
create clustered index DBA_QueriesSnapshot_ix1 on dbo.DBA_QueriesSnapshot
( Snapshot_Time ASC
)
with ( data_compression = page)
and then the query to add to the sql job step
insert into DBA_QueriesSnapshot
select getutcdate() as Snapshot_Time
, qs.database_id
, db_name(qs.database_id) as DatabaseName
, qs.session_id
, es.login_name
, es.host_name
, es.program_name
, es.client_interface_name
, qs.start_time
, qs.status
, qs.command
, qs.wait_type
, qs.open_transaction_count
, es.memory_usage
, es.deadlock_priority
, es.row_count
, es.reads
, es.writes
, wait_time as WaitTimemilliseconds
, waits.resource_description
, waits.blocking_session_id
, qs.total_elapsed_time as ElapsedTimemilliseconds
, substring(st.text, (qs.statement_start_offset / 2) + 1,
((case qs.statement_end_offset
when -1
then datalength(st.text)
else qs.statement_end_offset
end - qs.statement_start_offset) / 2) + 1) as statement_text
, text as FullText
, qp.query_plan
from sys.dm_exec_requests as qs
cross apply sys.dm_exec_sql_text(qs.sql_handle) as st
cross apply sys.dm_exec_query_plan(qs.plan_handle) as qp
inner join sys.dm_exec_sessions es
on es.session_id = qs.session_id
left outer merge join sys.dm_os_waiting_tasks waits
on waits.session_id = es.session_id
and waits.waiting_task_address = qs.task_address
where qs.session_id <> @@spid
and qs.wait_type <> 'TRACEWRITE'
-- change as required - this is 5 seconds - many time 10 seconds is enough
-- only report queries that are taking more than 5 seconds
and qs.total_elapsed_time > 5000
and substring(st.text, (qs.statement_start_offset / 2) + 1,
((case qs.statement_end_offset
when -1
then datalength(st.text)
else qs.statement_end_offset
end - qs.statement_start_offset) / 2) + 1)
not like 'WAITFOR DELAY%'
and es.program_name not like 'DatabaseMail%'
and qs.command not like 'backup%'
--and db_name(qs.database_id) not in ('DBATools') -- exclude databases if/as required
--and command not like 'DBCC' -- exclude particular commands if/as required
May 26, 2022 at 3:34 pm
So, on a Prod server, when DB Team receives Hgh CPU Alert or incident for a sustained period. what would be the response?
do you check currently running sql stmts driving high cpu , collect those statements and inform app team that these are consuming high cpu and try to optimize them and Kill those high cpu queries?
May 26, 2022 at 4:34 pm
having one of the above running you always have the statements on the database for investigation.
Even if they aren't consuming high CPU having the history (mainly the second query I have) enables you to look at which queries are taking more than x seconds to execute, what are their normal waits (you can in most, but not all, exclude the parallel waits) as well as which ones are getting wait locks on others.
this can then be used to identify badly performant queries and go back to the owners of those queries and advise of better ways of doing them (and no you should not just tell them they are wrong and THEY have to fix them - you as a DBA should look at them and identify ways of improving them and suggest the owners possible solutions)
regarding killing high CPU - that should ALWAYS be a last resource and only after agreeing with the business and only if it is preventing other processes to execute.
High CPU is not always bad - it just depends on what else needs to be using the server at that time.
I have servers where at times, during batch processing, are at a permanent 95% CPU - this is desirable and the processes were setup in multithreading in a way that they do not make the server work at 100%.
May 27, 2022 at 1:23 pm
Alternately, rather than building that yourself, get a SQL monitoring and reporting tool. The good ones will fire off an alert when a condition is met AND provide you with historical information about what was going on at that time.
Nice thing about buying it - support! You don't have to develop the scripts and figure how you are going to capture the data, it will alert you when problems occur other than just high CPU (such as instance offline, instance unreachable, SQL agent not running, low disk space, deadlocks, excessive blocking, etc). Now you CAN build things for this yourself, but why reinvent the wheel? And if you do build it yourself and a new version of SQL has a better method to capture the data (XE for example), do you REALLY want to go in and change all of your code to use the new and better method of capturing the data OR would you rather just have the update to the tool fix the problem?
PLUS the reporting tools offer a nice UI for you to view and manage all of the alerts! 1 nice web (usually web) interface to review all alerts and sometimes even offers recommendations on what to do with the alert. And if it doesn't capture everything you need, you can add custom metrics for your specific systems. For example, if you store serial numbers in an INT column, you MAY want an alert when you are about to run out of INTs and may need to change it to BIGINT. It is easy to add that to a monitoring tool and set up an alert when the value exceeds a specific number.
I also agree with frederico_fonseca - high CPU isn't always a bad thing. In a lot of cases with SQL, it is preferred. Unused CPU is wasted CPU. Now, you likely don't want it running 100% constantly, but having it in the 80% range frequently means your server is being utilized. My opinion - it is never a good practice to "look for issues". Find the problem you are trying to solve and solve that problem. If your "problem" is high CPU BUT it is not impacting anyone or anything, then it is likely not a problem. IF end users are reporting their queries are sometimes fast and sometimes slow, investigate what is happening during the slow times AND the fast times. It COULD be something like parameter sniffing or poor indexes or outdated statistics, all of which can be easy to fix. Not always easy to fix, but it sometimes can be an easy fix. If nobody is noticing any issues and you are just seeing HIGH CPU at times, I personally wouldn't worry about it UNLESS you are seeing problems that end users are not actually reporting like slow running queries, lots of long running transactions, lots of blocking or deadlocks, end users reporting issues, etc. AND the other thing with high CPU is that it MAY not even be SQL server that is causing the high CPU. It COULD be something like the antivirus. Also when I say "lots" that is VERY subjective and system specific. On my systems, 2 queries being blocked may be "lots", but on yours it may be 200 is the threshold for "lots".
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply