What Metrics Do You Collect?

  • Comments posted to this topic are about the item What Metrics Do You Collect?

  • With the benefit of hindsight, I didn't appreciate the wealth of metrics that SQL Server emits as part of its standard operation.  I'm not sure if it did so in v6.5 but certainly from v7 onwards it has.  As a data engineer I've come to realise just how lucky when I was as a DBA to have those metrics.  I think the major DB engines were massively ahead of their time in this respect.

    In the data pipelines and applications I write I think carefully about logs, metrics and traces.  They are invaluable for understanding how the system is performing and in diagnosing problems.  It's worth looking at the work of Charity Majors, either her blog on observability or her book on observability engineering.

    My organisation uses Open Telemetry SDKs and libraries in their apps and data pipelines.  This means that any tool (and there are many) that can ingest Open Telemetry can consume what our apps and pipelines emit.

    As an example, if I have a pipeline that loads a file, validates a file, sends a message with the validation status and then ingests it into a DB then my metrics will be counts for

    • File:Found
    • File:Read
    • File:Validate
    • File:ValidateSuccess
    • File:ValidateFailure
    • File:ValidateFailurePercentage
    • Message:ValidationSend

    If you were to look at a graph of each of these then you would quickly establish what normal operation would look like and therefore oddities would stick out like a sore thumb.  For example, you would expect

    • File:Found and File:Read to show the same or very similar counts
    • File:Found (and File:Read) to show similar volumes over a time period (say 10 file events per hour)
    • ValidateSuccess over any period to be very close to File:Read
    • ValidateFailure and ValidateFailurePercentage to be near to zero.

    Our systems can be working as intended but if we see File:Found events running a lot lower or higher than normal it is something to investigate.  Whatever provides files to our system may be broken or some change has resulted in more files than expected being sent.

    As a real world example, I know of a team who saw a massive spike in the number of messages their cloud infrastructure was expected to process.  They found out that someone has connected the wrong message queue to their ingestion pipeline.  Being able to spot that quickly prevented an immense cloud bill!

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply