Monitoring, tracing, observability in DevOps
monperrus opened this issue ยท 70 comments
- https://en.wikipedia.org/wiki/Tracing_(software)
- https://en.wikipedia.org/wiki/System_monitoring
- https://en.wikipedia.org/wiki/Network_monitoring
- https://en.wikipedia.org/wiki/Crash_reporter
- https://en.wikipedia.org/wiki/Application_performance_management
- https://en.wikipedia.org/wiki/Website_monitoring
- https://en.wikipedia.org/wiki/Provenance#Computer_science
- https://en.wikipedia.org/wiki/Log_analysis
We've found Istio ( https://istio.io/ ) to be increasingly useful in this context. KubeSpy ( https://github.com/pulumi/kubespy )is an excellent tool for troubleshooting and diagnosing Kubernetes deployments.
- Prometheus
- Sensu https://sensu.io/
- Zipkin
- the ELK stack.
+1 for Prometheus
Sentry for Error Reporting. https://sentry.io/welcome/
See also Runtime application self-protection #18 (comment)
Analytics
Tools and Benchmarks for Automated Log Parsing.
http://arxiv.org/abs/1811.03509
Does the Fault Reside in a Stack Trace? Assisting Crash Localization by Predicting Crashing Fault Residence
https://www.sciencedirect.com/science/article/pii/S0164121218302401
Having good dashboards is essential in DevOps, see Kibana, etc.
Made in Alibaba: https://github.com/alibaba/Sentinel
JVM Profiler Sending Metrics to Kafka (https://kafka.apache.org/), Console Output or Custom Reporter
https://github.com/uber-common/jvm-profiler
Time-series database to store monitoring data
https://en.wikipedia.org/wiki/Time_series_database
Prometheus - Monitoring system & time series database
https://prometheus.io/
Netflix Zuul is a gateway service that provides dynamic routing, monitoring, resiliency, security, and more.
https://github.com/Netflix/zuul
OpenTracing
https://opentracing.io/
Sensu is a free and open source monitoring that handles cloud environments. Sensu allows you to monitor servers, services, application health, and business KPIs.
https://xebialabs.com/technology/sensu/
Provenance analysis tools
- SPADE : https://github.com/ashish-gehani/spade
- Camflow : http://camflow.org/
Framework for instruction-level tracing and analysis of program executions
http://static.usenix.org/event/vee06/full_papers/p154-bhansali.pdf
DevOps Metrics
https://queue.acm.org/detail.cfm?id=3182626
Dapper, a large-scale distributed systems tracing infrastructure at Google
http://research.google.com/pubs/pub36356.html
Chaos Engineering & Observability
https://www.infoq.com/news/2019/03/chaos-engineering-observability
Humio: All of your data: logs, metrics, traces. Search, analyze and visualize instantly. Live system observability.
https://humio.com/
The OpenTracing project
https://opentracing.io/
Papers:
- Stardust: tracking activity in a distributed storage system 2006
- X-trace: A pervasive network tracing framework 2007
- Fay: extensible distributed tracing from kernels to clusters 2012
- So, you want to trace your distributed system? key design s from years of practical experience 2014
- Pivot tracing: Dynamic causal monitoring for distributed systems 2015
I cannot recommend Ben Sigelman enough
https://www.infoq.com/presentations/google-microservices
Ex google ; founded his company from the learnings
Must watch
Honeycomb is a tool for introspecting and interrogating your production systems.
https://www.honeycomb.io/
LightStep answers questions and diagnoses anomalies at scale, spanning mobile, monoliths, and microservices
https://lightstep.com/
Datadog: https://www.datadoghq.com/
Article: New distributed tracing API completes the feedback loop
https://www.theserverside.com/feature/New-distributed-tracing-API-completes-the-feedback-loop
Flame graphs and perf-top for JVMs inside Docker containers
http://www.batey.info/docker-jvm-flamegraphs.html
Synthetic Kubernetes cluster monitoring with Kuberhealthy
https://opensource.com/article/19/4/kuberhealthy
Course notes on monitoring: https://www.monperrus.net/martin/monitoring.pdf
Kiali project, observability for the Istio service mesh (thx @DokID)
https://github.com/kiali/kiali
transmitting metrics at scale
https://openmetrics.io/
Learning Chaos Engineering and Chaos toolkit on katacoda: https://www.katacoda.com/chaostoolkit
Contemporary Software Monitoring: A Systematic Literature Review
https://arxiv.org/abs/1912.05878
A curated list of Chaos Engineering resources.
https://github.com/dastergon/awesome-chaos-engineering/
Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.
https://www.gartner.com/smarterwithgartner/the-io-leaders-guide-to-chaos-engineering/
Contemporary Software Monitoring: A Systematic Mapping Study.
http://arxiv.org/pdf/1912.05878
Cilium - eBPF-based Networking, Observability, and Security
Cilium's control plane is highly optimized, running in Kubernetes clusters of up to 5K nodes and 100K pod
https://cilium.io/
Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. Can be used for monitoring events. Can be bridged with MQTT.
https://aws.amazon.com/kinesis/data-streams/
Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics.
Can be used to feed Prometheus.
Prometheus client libraries (including both official ones and many third-party ones) can be found here: https://prometheus.io/docs/instrumenting/clientlibs/
Paper: "Enjoy your observability: an industrial survey of microservice tracing and analysis" http://link.springer.com/10.1007/s10664-021-10063-9
Sampler is a tool for shell commands execution, visualization and alerting.
Configured with a simple YAML file.
https://sampler.dev/
Stagemonitor is a Java monitoring agent that tightly integrates with time series databases like Elasticsearch, Graphite and InfluxDB to analyze graphed metrics and Kibana to analyze requests and call stacks
https://github.com/stagemonitor/stagemonitor
cc/ @gluckzhang
Trace Server Protocol
https://github.com/eclipse-cdt-cloud/trace-server-protocol
Zabbix open source monitoring solution for network monitoring and application monitoring of millions of metrics.
https://www.zabbix.com/
strace is a diagnostic, debugging and instructional userspace utility for Linux. It is used to monitor and tamper with interactions between processes and the Linux kernel, which include system calls, signal deliveries, and changes of process state.
https://strace.io/
Let's Trace It: Fine-Grained Serverless Benchmarking using Synchronous and Asynchronous Orchestrated Applications
https://arxiv.org/pdf/2205.07696.pdf
Open Tracing Tools: Overview and Critical Comparison
https://arxiv.org/pdf/2207.06875.pdf
Reliability Pillar - AWS Well-Architected Framework
https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reliability-pillar/wellarchitected-reliability-pillar.pdf
Lessons Learned Building a Global Synthetic Monitoring System
Talk at SREcon
https://www.usenix.org/conference/srecon22apac/presentation/sidh
Elastic Observability
https://www.elastic.co/observability
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
https://github.com/upgundecha/howtheysre
Observability with Gitlab
https://opstrace.com/
https://about.gitlab.com/direction/monitor/observability/