Send Metrics Not Logs

  • Comments posted to this topic are about the item Send Metrics Not Logs

  • At present we use cloud provider specific solutions for logging, metrics and traces.

    Our experience with logging lead us to conclude that a lot more thought needs to go into logging.

    • Message content
    • Message structure
    • Log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
    • What is worth logging in the 1st place.

    To many log statements turned out to be the equivalent of debug.print statements used when building code rather than truly useful messages that are designed to monitor production code.

    We also found that exception handling was too generic. An all-encompassing try/catch rather than  catches targeted at specific exceptions.

    This leads to hugely bloated logs and makes diagnosis of problems harder rather than easier.  We have also learned the importance of recording the trace-id in any log messages so that logs and traces can be joined together.  Most of the time logs exist but we don't look at them.  It's when we need them that lack of design, patterns and thought become apparent.

    For metrics we found that we didn't have enough things being measured and also that OTEL (Open Telemetry) was our preferred approach.  This makes it possible for our clients to choose any tool that can consume OTEL events.  The metrics are essential for alarms and alerting which tend to be "threshold x has been breached for y seconds" type.

    Logs are what has happened.  Metrics can also tell you what is happening or is about to happen.

    Traces tell us the sequence and timing of events.  If you are using Chrome as a browser then you can see an example of what a trace looks like.

    • In the right hand toolbar choose the hamburger icon (3 vertical dots or bars)
    • Choose "More Tools"
    • Choose "Developer Tools"
    • In the resulting pane, choose Network
    • Hit any web page.

    You will get every call made from the web page with the sequence and duration of each event.  This can be VERY interesting.

    In the trace for our apps, if we see a particular event taking a long time, then as mentioned earlier,  we know to look for that trace_id in any log messages to get more details of what has been going on.

    In terms of tooling we are interested in https://www.honeycomb.io/ which was founded by Charity Majors who wrote Database Reliability Engineering.  She also blogs on Observability.

     

     

  • I've known about observability for a while now, but still struggle to get my head around it. Especially how it is different from monitoring. I look forward to this series on observability.

    Rod

  • Thank you for this article.  I am struggling with a much, much smaller problem. I am pulling the log data from one SQL Server to a central database.  The monitored server is busy and I really don't want to add to the burden on that server.  I am beginning to realize that it would be best to keep the detailed logs on the monitored server and to only push back the metrics I need.

    One thought is that if you need the detailed log data, the user can go to the monitored server to read the details.  If trying to do more detailed analysis over a long period of time, then bring the details over to the central database off hours.  The data will be good enough for most purposes.  Then if you need to dig into details, go to the monitored server.

    Russel Loski, MCSE Business Intelligence, Data Platform

  • Doctor Who 2 wrote:

    I've known about observability for a while now, but still struggle to get my head around it. Especially how it is different from monitoring. I look forward to this series on observability.

    I've alway thought it was the implementation of things that give you something to monitor.  There's probably a lot more to it than that.

     

  • David.Poole wrote:

    At present we use cloud provider specific solutions for logging, metrics and traces.

    ...

    In terms of tooling we are interested in https://www.honeycomb.io/ which was founded by Charity Majors who wrote Database Reliability Engineering.  She also blogs on Observability.

    Lol, I'm actually reading Observability Engineering, which she co-authored. It's OK, written a little to I-expect-you-to-know-some-of-this instead of leading us from logging/monitoring towards an observability, but once you work through the first 4 chapters, it gets better.

  • Russel Loski wrote:

    Thank you for this article.  I am struggling with a much, much smaller problem. I am pulling the log data from one SQL Server to a central database.  The monitored server is busy and I really don't want to add to the burden on that server.  I am beginning to realize that it would be best to keep the detailed logs on the monitored server and to only push back the metrics I need.

    One thought is that if you need the detailed log data, the user can go to the monitored server to read the details.  If trying to do more detailed analysis over a long period of time, then bring the details over to the central database off hours.  The data will be good enough for most purposes.  Then if you need to dig into details, go to the monitored server.

    I think avoiding eating up resources by sending logs is good. One interesting thing in obervability is also capturing the details as structured logs. We might think of this as putting a trace/XEvent path in a table, others use text but formatted like a known csv/tsv/psv/etc format. The idea is tracing all actions into a series of linked items to make one event for user actions.  In the db world, we might have the call stack for a process, along with exec plans to figure out what actually happened.

    Sending aggregates is good if you know what to send. If not, I'd send something, but keep details on the server for people to go look through.

  • I'm surprised a typical Chik-Fil-A would be generating a massive amount of telemetry and operational data, but in rural locations, it's probably constrained by lack of broadband internet.

    I suspect the live video feeds (not logs) contribute the most to bandwidth consumption.

     

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • If you watch the video, there are a lot of telemetry things. I don't think they send any video, but these stores might not have great bandwidth anyway.

  • Steve Jones - SSC Editor wrote:

    If you watch the video, there are a lot of telemetry things. I don't think they send any video, but these stores might not have great bandwidth anyway.

    POS transactional data is relatively small. But, if they're like most fast food franchises, then I image they have webcams in the kitchen, stock room, and etc. and then back at corporate HQ, there is a room full of monitors where they keep an eye on things from a quality control and security perspective, which can be a challenge squeezing through a DSL or satellite internet connection.

    https://www.securitymagazine.com/articles/95630-uses-for-smart-cameras-in-fast-casual-restaurants

     

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply