Collection of Articles on Observability and my opinion.
I have started learning and implementing Observability after I changed my role as a full-time DevOps engineer. As there are 1000s of posts, articles, a whitepaper on this topic. I am trying to consolidate them here for later reference or a starting pointing for a noobie like me!
Everything in ITALICS along with BOLD is my views which I believe either by reading a blog/watching a talk/or personal experience
- What is Observability?
- Logs
- Metrics
- PULL vs PUSH of Metrics
- PUSH for Events or Metrics
- What metrics to collect?
- Logs vs Metrics
- BlackBox vs WhiteBox Monitoring
- References
Ability to fully understand your systems
GOAL: Gain visibility into the Application behavior
A must watch if new to this topic: Keynote: ...What Does the Future Hold for Observability? - Tom Wilkie & Frederic Branczyk
- Logs
- Metrics
- Distributed Tracing
If I have all the 3 pillars in place, am I good? What do I do next?
Correlation among the pillars!
Observability Superpower: Correlation: https://www.openshift.com/blog/observability-superpower-correlation
Without correlation among pillars of Observability, it doesn't make sense!
LIES MY PARENTS TOLD ME (ABOUT LOGS) https://www.honeycomb.io/blog/lies-my-parents-told-me-about-logs/)
Logs are expensive
Too much operational cost
Logs and Metrics https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)
By far, the biggest drawbacks of logs is how operationally and monetarily expensive they are to process and store.
There’s ELK in the open-source space, but no one I know likes or wants to operate ELK.
Try to keep as much information in metrics instead of logs
The start of Metrics with Events: statsd https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
StatsD is a simple NodeJS daemon (and by “simple” I really mean simple — NodeJS makes event-based systems like this ridiculously easy to write) that listens for messages on a UDP port
In general, we tend to measure at three levels: network, machine, and application. Application metrics are usually the hardest, yet most important, of the three.
Collect only those metrics which is either graphed(over Grafana) or used in Alerting rules
https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push
Why do you pull rather than push?
Pulling over HTTP offers a number of advantages:
- You can run your monitoring on your laptop when developing changes.
- You can more easily tell if a target is down.
- You can manually go to a target and inspect its health with a web browser.
Overall, we believe that pulling is slightly better than pushing, but it should not be considered a major point when considering a monitoring system.
Twitter changed PULL to PUSH https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-overview-part-ii.html
Pull” vs “push” in metrics collection: At the time of our previous blog post, all our metrics were collected by “pulling” from our collection agents. We discovered two main issues:
- There is no easy way to differentiate service failures from collection agent failures. Service response time out and missed collection requests are both manifested as empty time series.
- There is a lack of service quality insulation in our collection pipeline. It is very difficult to set an optimal collection time out for various services. A long collection time from one single service can cause a delay for other services that share the same collection agent.
In light of these issues, we switched our collection model from “pull” to “push” and increased our service isolation. Our collection agent on each host only collects metrics from services running on that specific host. Additionally, each collection agent sends separate collection status tracking metrics in addition to the metrics emitted by the services.
I personally feel PULL is better than PUSH with Service Discovery in place!
https://www.robustperception.io/which-kind-of-push-events-or-metrics
The primary issue with pushing events is that the volume of data is proportional to the amount of processing your application is doing.
If you have twice the requests, you're going to have twice the events to handle. As you're communicating over the network to the process of collecting these events (and often across machines), this can get problematic in terms of both CPU usage and network traffic.
UPD: Pushing events is often done via UDP, which is unreliable and it is expected you'll lose a few packets.
TCP: When there's too many events to process you must either have enough RAM to queue them up, or drop some events on the floor as UDP does. The data volume is the fundamental issue here, rather than the exact implementation.
Metrics by contrast have the same resource usage to push them no matter how busy your system is.
- The USE method: How happy your servers are?
- The RED method: How happy your customers are?
- The four Golden Signals: RED + Saturation(How full your services are)?
- http://www.brendangregg.com/usemethod.html
- https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
- https://grafana.com/files/grafanacon_eu_2018/Tom_Wilkie_GrafanaCon_EU_2018.pdf
- The RED Method: How To Instrument Your Services [B] - Tom Wilkie, Kausal: https://www.youtube.com/watch?v=TJLpYXbnfQ4
I highly recommend watching the above-mentioned video
https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38
By and large, the biggest advantage of metrics based monitoring over logs is the fact that unlike log generation and storage, metrics transfer and storage has a constant overhead. Unlike logs, the cost of metrics doesn’t increase in lockstep with user traffic or any other system activity that could result in a sharp uptick in Observability data.
What this means is that with metrics, an increase in traffic to an application will not incur a significant increase in disk utilization, processing complexity, speed of visualization, and operational costs the way logs do. Metrics storage increases with the number of time series being captured (when more hosts/containers are spun up, or when new services get added or when existing services are instrumented more), but unlike statsd clients that send a UDP packet every time a metric is recorded to the statsd daemon (resulting in a directly proportional increase in the number of metrics being submitted to statsd compared to the traffic being reported on!), client libraries of systems like Prometheus aggregate time-series samples in-process and submit them to the Prometheus server upon a successful scrape (which happens once every few seconds and can be configured).
Metrics, once collected, is also more malleable to mathematical and statistical transformations such as sampling, aggregation, summarization, and correlation, which make it better suited for monitoring and profiling purposes. Metrics are also better suited to trigger alerts, since running queries against an in-memory time-series database is far more efficient than running a query against a distributed system like ELK and then aggregating the results before deciding if an alert needs to be triggered. Of course, systems like Facebook’s Scuba strictly query only in-memory data, but the operational overhead of running a Scuba-esque system, even if it were open-source, isn’t something worth the trouble for most.
BlackBox tells what is wrong!
Whitebox monitoring tells why is it wrong
- Monitoring and Observability(https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c)
- Logs and Metrics(https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)
- Black Box vs. White Box Monitoring: What You Need To Know(https://devops.com/black-box-vs-white-box-monitoring-what-you-need-to-know/)