KTH/devops-course

Monitoring, tracing, observability in DevOps

monperrus opened this issue ยท 70 comments

See also icinga (thanks to @henriklb for the suggestion)

We've found Istio ( https://istio.io/ ) to be increasingly useful in this context. KubeSpy ( https://github.com/pulumi/kubespy )is an excellent tool for troubleshooting and diagnosing Kubernetes deployments.

lsc commented

+1 for Prometheus

Sentry for Error Reporting. https://sentry.io/welcome/

See also Runtime application self-protection #18 (comment)

Analytics

Tools and Benchmarks for Automated Log Parsing.
http://arxiv.org/abs/1811.03509

Does the Fault Reside in a Stack Trace? Assisting Crash Localization by Predicting Crashing Fault Residence
https://www.sciencedirect.com/science/article/pii/S0164121218302401

Having good dashboards is essential in DevOps, see Kibana, etc.

JVM Profiler Sending Metrics to Kafka (https://kafka.apache.org/), Console Output or Custom Reporter
https://github.com/uber-common/jvm-profiler

Time-series database to store monitoring data
https://en.wikipedia.org/wiki/Time_series_database

Prometheus - Monitoring system & time series database
https://prometheus.io/

Netflix Zuul is a gateway service that provides dynamic routing, monitoring, resiliency, security, and more.
https://github.com/Netflix/zuul

Sensu is a free and open source monitoring that handles cloud environments. Sensu allows you to monitor servers, services, application health, and business KPIs.
https://xebialabs.com/technology/sensu/

Provenance analysis tools

Framework for instruction-level tracing and analysis of program executions
http://static.usenix.org/event/vee06/full_papers/p154-bhansali.pdf

Dapper, a large-scale distributed systems tracing infrastructure at Google
http://research.google.com/pubs/pub36356.html

Humio: All of your data: logs, metrics, traces. Search, analyze and visualize instantly. Live system observability.
https://humio.com/

The OpenTracing project
https://opentracing.io/

Papers:

  • Stardust: tracking activity in a distributed storage system 2006
  • X-trace: A pervasive network tracing framework 2007
  • Fay: extensible distributed tracing from kernels to clusters 2012
  • So, you want to trace your distributed system? key design s from years of practical experience 2014
  • Pivot tracing: Dynamic causal monitoring for distributed systems 2015

I cannot recommend Ben Sigelman enough

https://www.infoq.com/presentations/google-microservices

Ex google ; founded his company from the learnings
Must watch

Honeycomb is a tool for introspecting and interrogating your production systems.
https://www.honeycomb.io/

LightStep answers questions and diagnoses anomalies at scale, spanning mobile, monoliths, and microservices
https://lightstep.com/

Article: New distributed tracing API completes the feedback loop
https://www.theserverside.com/feature/New-distributed-tracing-API-completes-the-feedback-loop

Flame graphs and perf-top for JVMs inside Docker containers
http://www.batey.info/docker-jvm-flamegraphs.html

Synthetic Kubernetes cluster monitoring with Kuberhealthy
https://opensource.com/article/19/4/kuberhealthy

Kiali project, observability for the Istio service mesh (thx @DokID)
https://github.com/kiali/kiali

transmitting metrics at scale
https://openmetrics.io/

Learning Chaos Engineering and Chaos toolkit on katacoda: https://www.katacoda.com/chaostoolkit

Contemporary Software Monitoring: A Systematic Literature Review
https://arxiv.org/abs/1912.05878

A curated list of Chaos Engineering resources.
https://github.com/dastergon/awesome-chaos-engineering/

Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.

https://www.gartner.com/smarterwithgartner/the-io-leaders-guide-to-chaos-engineering/

Contemporary Software Monitoring: A Systematic Mapping Study.
http://arxiv.org/pdf/1912.05878

Cilium - eBPF-based Networking, Observability, and Security
Cilium's control plane is highly optimized, running in Kubernetes clusters of up to 5K nodes and 100K pod
https://cilium.io/

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. Can be used for monitoring events. Can be bridged with MQTT.
https://aws.amazon.com/kinesis/data-streams/

Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics.

Can be used to feed Prometheus.

https://micrometer.io/

Prometheus client libraries (including both official ones and many third-party ones) can be found here: https://prometheus.io/docs/instrumenting/clientlibs/

Paper: "Enjoy your observability: an industrial survey of microservice tracing and analysis" http://link.springer.com/10.1007/s10664-021-10063-9

Sampler is a tool for shell commands execution, visualization and alerting.
Configured with a simple YAML file.
https://sampler.dev/

Stagemonitor is a Java monitoring agent that tightly integrates with time series databases like Elasticsearch, Graphite and InfluxDB to analyze graphed metrics and Kibana to analyze requests and call stacks

https://github.com/stagemonitor/stagemonitor

cc/ @gluckzhang

Zabbix open source monitoring solution for network monitoring and application monitoring of millions of metrics.
https://www.zabbix.com/

strace is a diagnostic, debugging and instructional userspace utility for Linux. It is used to monitor and tamper with interactions between processes and the Linux kernel, which include system calls, signal deliveries, and changes of process state.
https://strace.io/

Let's Trace It: Fine-Grained Serverless Benchmarking using Synchronous and Asynchronous Orchestrated Applications
https://arxiv.org/pdf/2205.07696.pdf

Open Tracing Tools: Overview and Critical Comparison
https://arxiv.org/pdf/2207.06875.pdf

Lessons Learned Building a Global Synthetic Monitoring System
Talk at SREcon
https://www.usenix.org/conference/srecon22apac/presentation/sidh

A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
https://github.com/upgundecha/howtheysre