DBAs are in charge of performance on production SQL servers and, sure, they hate to hear complains from end-users that the system is slow. But the truth is that often the complains are not groundless. As long as for developers performance is a last thing to care about after delivering a required functionality, providing wonderful interface, debugging and loud celebrating a new release shipped to production, then we have what we have. As a result, managers use to ask “What’s going on on the server?” and DBAs really need to have a clear and accurate answer.
Getting active processes
For myself I created a stored procedure sp_ActiveProcesses (Listing 1) showing what's running on the SQL Server. The sp has one parameter - time interval - to retrieve snapshot of running processes. Default is 5 sec. It can be decreased to make the snapshot more instantaneous or increased to catch consumers for a longer term. A process to hit into the snapshot must to be running at the beginning and at the end of the period. It's why there is no sense to make the interval too long (though maximum allowed is 59 sec).
I run the sp so often that I had to create a shortcut in my Query Analyzer (Tools/Customize/Custom ...). Install the sp, set the shortcut assigning the sp, say, to Ctrl-4, press these two buttons and you've got the picture (the list of processes wrapped to fit on the page):
ProcessId | TotalCPU | CPU_ConsumedInTheTimeFragment | TotalPhysical_IO | IO_InTheTimeFragment | ... |
---|---|---|---|---|---|
---------- | --------- | ------------------------------ | ----------------- | --------------------- | ... |
55 | 239 | 109 | 21 | 10 | ... |
85 | 31328 | 31 | 7521 | 536 | ... |
88 | 5678 | 1001 | 795 | 164 | ... |
... | Hostname | ApplicationName | NT_LoginName | DatabaseName | SPIDBuffer |
---|---|---|---|---|---|
--------- | ---------------- | -------------- | ------------ | ----------------------- | |
BillS | MSP | Company\Bill | MSP | GetContacts | |
KirkA | MSP | Company\Kirk | MSP | ReassignTimeApprover | |
KimN | MSP | Company\Kim | MSP | InsertExpense |
TheFragmentDuration | NumberOfCPUs | SUM_CPU_Consumed | SUM_Physical_IO_Committed |
---|---|---|---|
------------------------ | --------------- | ---------------- | --------------------------------- |
5123 | 2 | 1141 | 710 |
Maximum SUM_CPU_Consumed can be TheFragmentDuration times NumberOfCPUs. In the example above it is 10246 milliseconds.
Just one note about accuracy of the numbers being showed by the sp. Microsoft Knowledge Base Article - 309377 says: In Microsoft SQL Server 2000 (all editions) CPU time for a particular server process ID (SPID) may not accumulate correctly.
It is explained as: SQL Server maintains a pool of worker threads that execute the tasks given to it. SQL Server may assign a different thread from this pool to a given SPID to execute a new query batch. When SQL Server assigns a different thread to a SPID, SQL Server does not properly calculate the CPU time to indicate the accumulated CPU time up to that point for the SPID.
Microsoft has confirmed this to be a problem in SQL Server 2000. This problem was first corrected in Microsoft SQL Server 2000 Service Pack 2.
The wrong calculation almost never happens to user processes, which are most important for us, even there is no SP2 for SQL Server 2000 installed. In contrary, I saw lots of such cases with system processes, for instance, replication agents.
Getting blocked processes
There are two kinds of SQL Server processes in the scope of DBA's concern:
1. Consumers
2. Long-runners
Consumers eat up server resources. By identifying and optimizing the consuming codes you can free some room for other processes. Long-runners are codes having outstanding duration. Exactly they make end-users uncomfortable. However for developers to provide optimization it is more important "what consumes more" than "what works longer". Because sometimes innocent processes take minutes being blocked by other processes (or maybe not blocked, but just waiting for resources - CPU, I/O to be released). And developers can do nothing about the cases. Not these codes but rather those blocking/consuming codes must be optimized. In other words: “consumption” indicates how healthy is the code, “duration” indicates how healthy is the system (hardware, surrounding processes etc).
To identify processes blocked by others I created sp_BlockedProcesses (Listing 2). Assign it to the shortcut, say, Ctrl-5, press the buttons and here we go:
BlockedSPID | BlockedBuffer | BlockingSPID | BlockingBuffer | waitresource | BlockedHostname | ... |
---|---|---|---|---|---|---|
----------- | ------------- | ------------ | -------------- | ------------ | --------------- | ... |
21 | GetLateTasks | 65 | GetImage | 21 | KimN | ... |
5 | SetStatus | 65 | GetImage | 21 | JasonC | ... |
I bet you can recall cases when some simple code was started and a quick reply was expected, but it seemed was hanging. It is quite probable that by pressing Ctrl-5 you'd see what is the matter.
Sending emails with top consumers
Both sp_ActiveProcesses and sp_BlockedProcesses are instantaneous. Of course a DBA needs overall reports showing top consumers and long-runners. I can share how I organized it in the company I work for.
1. A job was scheduled to run every morning to start a trace on production servers. One of the trace parameters specifies time to stop the trace. The job runs at 8:30 AM and the trace stops itself at 5:30 PM, when the peak of user activity on production servers is over. The name of the executed stored procedure: spTraceBuild (Listing 3).
spBuildTrace is based on a very useful stored procedure build_trace from the Microsoft Knowledge Base Article Q283790. I did the following minor modification:
a. Added server name and current date to the trace file name
b. Added error handling. If a mistake was done (for example, an option value is incorrect or a date to stop the trace has been already expired) and the trace wasn't created, it's nice to get a message.
c. Changed the code to expect only time to stop the trace (instead of date/time) - 5:30 PM in my case. I.e. you don't need ever to modify ActivityTrace.ini. The trace will be always stopped at the specified time on the same day it was started.
spTraceBuild gets configuration data from a text file named ActivityTrace.ini. It contents could be like:
@tracefile | = | \\Process02\D$\Program Files\Microsoft SQL Server\MSSQL\LOG\Trace |
@maxfilesize | = | 15000 |
@stoptime | = | 5:30 |
@options | = | 2 |
@events | = | 10,12 |
@columns | = | 1,3,12,13,14,16,17,18 |
@filter1 | = | 10, 0, 7, N'SQL Profiler' |
Apparently you need to modify at least the first parameter, @tracefile, to make it appropriate.
Two types of events give us consumption numbers:
10 - RPC Completed
12 - Batch Completed
2. Another job was scheduled to run every night to absorb the trace file and process it, i.e. to insert the trace data into a SQL Server table and aggregate the information. Why to collect the data into a file to bulkcopy them afterward into a table instead of collecting them directly into the table? Firstly, collecting trace data into a file works faster, secondly, you cannot run a trace programmatically into a table as you do when starting a trace from Profiler. I created the following aggregations:
- top CPU consumers
- top long-runners
3. The processing job sends an e-mail containing the reports to managers of development. Every morning development managers can find the "Top consumers" report in their Inbox. That is important as long as performance is a serious issue in my company. You can schedule the trace and processing/reporting jobs to run once a week, for example, on Tuesday, if user activity and workload do not differ from day to day.
The name of the processing/reporting stored procedure is spProcessTrace (Listing 4). An important part of the code is a UDF fnExtractSPNameFromTextData (Listing 5). Definitely, you can aggregate data from a trace only if codes running against the server are named codes as stored procedures are. Ad hoc queries will be out of the scope. However, I do not think any interactive, frequently executed codes can be running as ad hoc queries, which would need compilation on the fly. Therefore, all real top consumers should be represented in the report.
Run the sp from a scheduled job as:
EXEC spProcessTraceFile@ServerName = 'MyProductionServerName',
@ReportDate = null, -- date the trace file was created on (default - current)
@TraceFilePath = '\\MyProcessingServerName\C$\Program Files\Microsoft SQL Server\MSSQL\LOG\',
@recipients = 'Manager1@MyCompany.com;Manager2@MyCompany.com'
The value specified in @TraceFilePath should match to @tracefile in ActivityTrace.ini.
Resulting report for top CPU consumers looks like:
Top Process02 CPU consumers for Feb 22, 2003:
SP | TimesExecuted | TotalCPU | AverageExecTime | MinCPU | MaxCPU |
---|---|---|---|---|---|
------------------------ | --------------- | ------------ | ------------------ | ------------- | -------- |
RemoveCoManager | 7 | 615531 | 87933 | 0 | 595062 |
UpdateProspect | 110 | 474517 | 4313 | 2328 | 29062 |
AddStandardTaskTime | 673 | 457651 | 680 | 594 | 829 |
TaskbyResource | 2480 | 130684 | 52 | 0 | 6656 |
GetAssetTypes | 5318 | 88720 | 16 | 0 | 78 |
SubmitExpenseById | 1583 | 63696 | 40 | 0 | 719 |
BillingRatesByBillingOffices | 110 | 63164 | 574 | 32 | 1312 |
SessionCleanUp | 1231 | 56099 | 45 | 0 | 19406 |
CheckSummaryTask | 230 | 16443 | 71 | 46 | 110 |
RollupSummaryTask | 207 | 15844 | 76 | 0 | 281 |
CreateBatchNumber | 2720 | 14146 | 5 | 0 | 32 |
RejectTime | 1345 | 13396 | 9 | 0 | 79 |
DeleteBillingRole | 12 | 12108 | 1009 | 578 | 1390 |
ProjectSummary | 143 | 10003 | 69 | 15 | 172 |
GetApprovedInvoices | 12 | 9767 | 813 | 718 | 1032 |
ProgressProject | 228 | 8322 | 36 | 0 | 94 |
AddSubProject | 280 | 7875 | 28 | 0 | 265 |
InsertExpense | 7 | 7422 | 1060 | 0 | 5906 |
LoadCustomer | 16 | 6953 | 434 | 312 | 688 |
PercentOfContractIncurred | 164 | 5790 | 35 | 15 | 47 |
GetTaxes | 8 | 5469 | 683 | 640 | 828 |
RolesFeatures | 6 | 5330 | 888 | 750 | 1016 |
GetWorkflowTypes | 246 | 4519 | 18 | 0 | 78 |
GetDraftInvoices | 250 | 4439 | 17 | 0 | 63 |
Consider locations of the codes and files as the following:
spBuildTrace | msdb on a production server |
spProcessTrace | DB Traces on a processing server |
fnExtractSPNameFromTextData | DB Traces on a processing server |
ActivityTrace.ini | C:\Program Files\Microsoft SQL Server\MSSQL\Binn\ |
on the production server |
Activity Graph
The Results Pane of Query Analyzer does not allow us to represent results graphically, but spActivityGraph (Listing 6) challenges this limitation. The stored procedure uses the trace table created by spProcessTrace. spActivityGraph shows how processes running on the SQL Server interfere with each other. Looking at the graph you can see peak-times, concurrency and how this or that process is taking longer than usually being surrounded by a tough company of other processes:
StartTime | Duration | Text | 11:3011:3911:4911:59 |
---|---|---|---|
11:38:42 | 12033 | InsertExpense | -- |
11:39:10 | 6220 | GetAssetTypes | -- |
11:40:00 | 122810 | GetRequestsToApprove | -------------- |
11:40:06 | 52826 | GetMonthlyRevenue | ------ |
11:40:11 | 30516 | GetApproverTimes | ---- |
11:40:16 | 30527 | GetApproverTimes | ---- |
11:40:17 | 30533 | GetApproverTimes | ---- |
11:40:25 | 30530 | PopulatePlanFact | ---- |
11:40:28 | 30543 | ProjectSummary | ---- |
11:40:28 | 30516 | LoadLeadByResource | ---- |
11:40:30 | 30513 | ProjectSummary | ---- |
11:40:36 | 11736 | SetLockAllTask | -- |
11:40:38 | 21623 | InvoiceByClient | -- |
11:40:42 | 103116 | PopulatePlanFact | ------------ |
11:40:44 | 15780 | GetDraftInvoices | -- |
11:40:49 | 10310 | InsertAd | -- |
11:40:50 | 9513 | ModifyCodeUpdatedExpense | -- |
11:40:51 | 8280 | DeleteBillingRole | -- |
11:40:59 | 60966 | ProjectSummary | -------- |
11:41:04 | 30516 | AutoEscalate | ---- |
11:41:07 | 30446 | GetLicenceUpdate | ---- |
11:41:21 | 5046 | GetImageBatch | - |
spActivityGraph has 6 input parameters:
@TraceTable | - | name of the trace table created in spProcessTrace |
@TraceStart | - | start time of the period |
@TraceEnd | - | end time of the period |
@SPNameLayoutLen | - | width of the column with SP names in the report. Default - 20. |
@DurationLayoutLen | - | width of the column with duration values in the report. Default - 6 |
@LayoutWidth | - | width of the report. Default - 115 symbols (wrapping would make the picture totally messed up). |
I would suggest to build graphs for 30 minutes intervals. If you increase the interval trying to cover the entire day or at least a half of the day in one graph, duration of one unit (shown by dash '-') will be also increased and even processes with duration more than 10 sec will be missed in the graph. What if nonetheless you would like to see the activity for the entire day? No problem: the stored procedure spShowActivityGraphByChunks (Listing 7) will give you the full day picture divided into 0.5-hour pieces. The only 2 mandatory input parameters for the stored procedure (@ServerName, @ReportDate) serve to identify a trace table to work with.
Conclusion
Scripts referenced in the article are available here.
Stored procedures showing instantaneous and overall performance reports give us a clear picture of user activity on production SQL Servers and help us to find ways to make the performance better.