What Metrics Do You Collect?

Question

What Metrics Do You Collect?

Steve Jones - SSC Editor

SSC Guru

Points: 737159
More actions
June 28, 2024 at 12:00 am

#4431180

Comments posted to this topic are about the item What Metrics Do You Collect?

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply

David.Poole SSC Guru Points: 76104 More actions · Answer 1

With the benefit of hindsight, I didn't appreciate the wealth of metrics that SQL Server emits as part of its standard operation. I'm not sure if it did so in v6.5 but certainly from v7 onwards it has. As a data engineer I've come to realise just how lucky when I was as a DBA to have those metrics. I think the major DB engines were massively ahead of their time in this respect.

In the data pipelines and applications I write I think carefully about logs, metrics and traces. They are invaluable for understanding how the system is performing and in diagnosing problems. It's worth looking at the work of Charity Majors, either her blog on observability or her book on observability engineering.

My organisation uses Open Telemetry SDKs and libraries in their apps and data pipelines. This means that any tool (and there are many) that can ingest Open Telemetry can consume what our apps and pipelines emit.

As an example, if I have a pipeline that loads a file, validates a file, sends a message with the validation status and then ingests it into a DB then my metrics will be counts for

File:Found
File:Read
File:Validate
File:ValidateSuccess
File:ValidateFailure
File:ValidateFailurePercentage
Message:ValidationSend

If you were to look at a graph of each of these then you would quickly establish what normal operation would look like and therefore oddities would stick out like a sore thumb. For example, you would expect

File:Found and File:Read to show the same or very similar counts
File:Found (and File:Read) to show similar volumes over a time period (say 10 file events per hour)
ValidateSuccess over any period to be very close to File:Read
ValidateFailure and ValidateFailurePercentage to be near to zero.

Our systems can be working as intended but if we see File:Found events running a lot lower or higher than normal it is something to investigate. Whatever provides files to our system may be broken or some change has resulted in more files than expected being sent.

As a real world example, I know of a team who saw a massive spike in the number of messages their cloud infrastructure was expected to process. They found out that someone has connected the wrong message queue to their ingestion pipeline. Being able to spot that quickly prevented an immense cloud bill!

LinkedIn Profile