SQLServerCentral Editorial

What Metrics Do You Collect?

,

One of the hot terms in software these days is observability. There are a few definitions (Splunk, RadixWeb), but essentially this is the insight into how your software runs and performs using metrics, logs, traces, etc. In DevOps, we do this with an eye toward improving performance and identifying the root cause of issues. The focus is slightly different from monitoring, where we often focus more on resources and health. We need both, but often in trying to improve software and the behavior for users, developers need observability. Infrastructure people responding to acute issues and looking to ensure we have the capacity, availability, and other x-bilities, that need monitoring.

Today I'm wondering if you collect a variety of types of metrics for your software that might tell you how your system is running. What things are important to you in order to better serve your clients? If you're a DBA/sysadmin, what is important to you? If you are a developer, are there different types of data you want?

Certainly, you might collect various resource measures (CPU, IO, reads, etc.), but there are many more things. There are logs, which could include the SQL Server error log, but I'd hope that you had a more in-depth way of measuring the activity on your system. Do you have custom xEvent traces running? Can you collect application logs easily when you're looking at issues? Do you spend time trying to solve chronic issues? Do you look for potential future problems?

Most software applications should include some sort of basic logging of major functions, but I would hope that there are various levels available. If problems are reported or noticed, can you increase the detail of logging? Can you correlate this with logs from different systems, such as the database? Can I get execution data that matches calls and trace down the potential issues that users are experiencing?

I know that we have problems in applications. Some (many?) of these are data-related, which might be easy or hard to trace down. That often depends if the data changes too quickly or is static enough for someone to investigate a report. Some errors are logical code errors, which might indicate a lack of testing early in the software process, but we ought to be able to determine this quickly from logs. If there are performance issues, and we have a lot of these, how easily can we verify a problem?

Let us know what types of metrics help you solve issues. I certainly think that you should have some sort of monitoring and observability system in place that helps you dive deep into the database, especially concerning execution plans at the time of the issue. The situation can change quickly inside a database, so capturing data regularly is important. If there is something you wish you had, let us know as well. Maybe someone else will have a neat solution for you.

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating