Monitoring, tracing, observability in DevOps

Question

Monitoring, tracing, observability in DevOps

monperrus opened this issue 7 years ago · 70 comments

monperrus commented 7 years ago

monperrus commented 6 years ago

Analytics

Answer 1 · 2018-08-20T18:53:01.000Z

See also icinga (thanks to @henriklb for the suggestion)

Answer 2 · 2018-09-18T14:46:57.000Z

Log analysis @eclipse https://projects.eclipse.org/projects/tools.tracecompass

Answer 3 · 2018-10-11T08:29:51.000Z

We've found Istio ( https://istio.io/ ) to be increasingly useful in this context. KubeSpy ( https://github.com/pulumi/kubespy )is an excellent tool for troubleshooting and diagnosing Kubernetes deployments.

Answer 4 · 2018-10-11T09:41:36.000Z

Prometheus
Sensu https://sensu.io/
Zipkin
the ELK stack.

Answer 5 · 2018-10-11T09:58:00.000Z

+1 for Prometheus

Answer 6 · 2018-10-18T09:25:10.000Z

Sentry for Error Reporting. https://sentry.io/welcome/

Answer 7 · 2018-10-26T09:09:42.000Z

OpenZipkin
Jaeger https://github.com/jaegertracing/jaeger
https://medium.com/@rakyll/cpdd-critical-path-driven-development-6c2592fb8ea4

(from #16 (comment))

Answer 8 · 2018-11-05T14:10:12.000Z

See also Runtime application self-protection #18 (comment)

Answer 9 · 2018-11-12T20:54:37.000Z

Tools and Benchmarks for Automated Log Parsing.
http://arxiv.org/abs/1811.03509

Answer 10 · 2018-11-12T21:10:27.000Z

Does the Fault Reside in a Stack Trace? Assisting Crash Localization by Predicting Crashing Fault Residence
https://www.sciencedirect.com/science/article/pii/S0164121218302401

Answer 11 · 2018-12-10T13:35:51.000Z

Having good dashboards is essential in DevOps, see Kibana, etc.

Answer 12 · 2019-01-22T20:55:38.000Z

Made in Alibaba: https://github.com/alibaba/Sentinel

Answer 13 · 2019-02-22T10:49:51.000Z

JVM Profiler Sending Metrics to Kafka (https://kafka.apache.org/), Console Output or Custom Reporter
https://github.com/uber-common/jvm-profiler

Answer 14 · 2019-03-05T10:30:55.000Z

https://github.com/madflojo/automatron

Answer 15 · 2019-03-05T10:31:02.000Z

https://github.com/apache/incubator-skywalking

Answer 16 · 2019-03-05T10:39:37.000Z

Time-series database to store monitoring data
https://en.wikipedia.org/wiki/Time_series_database

Answer 17 · 2019-03-05T10:39:47.000Z

Prometheus - Monitoring system & time series database
https://prometheus.io/

Answer 18 · 2019-03-05T10:40:36.000Z

Netflix Zuul is a gateway service that provides dynamic routing, monitoring, resiliency, security, and more.
https://github.com/Netflix/zuul

Answer 19 · 2019-03-05T10:41:13.000Z

OpenTracing
https://opentracing.io/

Answer 20 · 2019-03-05T10:41:41.000Z

Nagios
https://en.wikipedia.org/wiki/Nagios

Answer 21 · 2019-03-05T10:42:30.000Z

Sensu is a free and open source monitoring that handles cloud environments. Sensu allows you to monitor servers, services, application health, and business KPIs.
https://xebialabs.com/technology/sensu/

Answer 22 · 2019-03-05T10:49:19.000Z

Provenance analysis tools

SPADE : https://github.com/ashish-gehani/spade
Camflow : http://camflow.org/

Answer 23 · 2019-03-07T14:58:30.000Z

Framework for instruction-level tracing and analysis of program executions
http://static.usenix.org/event/vee06/full_papers/p154-bhansali.pdf

Answer 24 · 2019-03-22T09:11:35.000Z

DevOps Metrics
https://queue.acm.org/detail.cfm?id=3182626

Answer 25 · 2019-03-22T09:11:57.000Z

Dapper, a large-scale distributed systems tracing infrastructure at Google
http://research.google.com/pubs/pub36356.html

Answer 26 · 2019-03-29T07:35:58.000Z

Chaos Engineering & Observability
https://www.infoq.com/news/2019/03/chaos-engineering-observability

Answer 27 · 2019-03-29T07:36:37.000Z

Humio: All of your data: logs, metrics, traces. Search, analyze and visualize instantly. Live system observability.
https://humio.com/

Answer 28 · 2019-03-29T07:37:14.000Z

The OpenTracing project
https://opentracing.io/

Answer 29 · 2019-04-05T09:40:34.000Z

Papers:

Stardust: tracking activity in a distributed storage system 2006
X-trace: A pervasive network tracing framework 2007
Fay: extensible distributed tracing from kernels to clusters 2012
So, you want to trace your distributed system? key design s from years of practical experience 2014
Pivot tracing: Dynamic causal monitoring for distributed systems 2015

Answer 30 · 2019-04-05T15:28:04.000Z

I cannot recommend Ben Sigelman enough

https://www.infoq.com/presentations/google-microservices

Ex google ; founded his company from the learnings
Must watch

Answer 31 · 2019-04-06T06:06:27.000Z

Honeycomb is a tool for introspecting and interrogating your production systems.
https://www.honeycomb.io/

Answer 32 · 2019-04-06T06:07:06.000Z

LightStep answers questions and diagnoses anomalies at scale, spanning mobile, monoliths, and microservices
https://lightstep.com/

Answer 33 · 2019-04-08T13:54:15.000Z

Datadog: https://www.datadoghq.com/

Answer 34 · 2019-04-08T14:05:16.000Z

Article: New distributed tracing API completes the feedback loop
https://www.theserverside.com/feature/New-distributed-tracing-API-completes-the-feedback-loop

Answer 35 · 2019-04-10T19:41:21.000Z

Flame graphs and perf-top for JVMs inside Docker containers
http://www.batey.info/docker-jvm-flamegraphs.html

Answer 36 · 2019-05-02T04:09:13.000Z

Synthetic Kubernetes cluster monitoring with Kuberhealthy
https://opensource.com/article/19/4/kuberhealthy

Answer 37 · 2019-05-02T16:33:43.000Z

Course notes on monitoring: https://www.monperrus.net/martin/monitoring.pdf

Answer 38 · 2019-05-22T17:16:11.000Z

Kiali project, observability for the Istio service mesh (thx @DokID)
https://github.com/kiali/kiali

Answer 39 · 2019-12-04T13:03:04.000Z

transmitting metrics at scale
https://openmetrics.io/

Answer 40 · 2019-12-05T07:46:25.000Z

Learning Chaos Engineering and Chaos toolkit on katacoda: https://www.katacoda.com/chaostoolkit

Answer 41 · 2019-12-15T17:18:18.000Z

Contemporary Software Monitoring: A Systematic Literature Review
https://arxiv.org/abs/1912.05878

Answer 42 · 2020-03-17T09:50:24.000Z

A curated list of Chaos Engineering resources.
https://github.com/dastergon/awesome-chaos-engineering/

Answer 43 · 2020-03-17T09:51:12.000Z

Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.

https://www.gartner.com/smarterwithgartner/the-io-leaders-guide-to-chaos-engineering/

Answer 44 · 2020-10-28T08:12:51.000Z

Contemporary Software Monitoring: A Systematic Mapping Study.
http://arxiv.org/pdf/1912.05878

Answer 45 · 2021-09-09T11:34:43.000Z

Cilium - eBPF-based Networking, Observability, and Security
Cilium's control plane is highly optimized, running in Kubernetes clusters of up to 5K nodes and 100K pod
https://cilium.io/

Answer 46 · 2021-10-29T14:11:31.000Z

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. Can be used for monitoring events. Can be bridged with MQTT.
https://aws.amazon.com/kinesis/data-streams/

Answer 47 · 2021-10-29T14:15:45.000Z

Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics.

Can be used to feed Prometheus.

https://micrometer.io/

Answer 48 · 2021-11-01T08:55:19.000Z

Prometheus client libraries (including both official ones and many third-party ones) can be found here: https://prometheus.io/docs/instrumenting/clientlibs/

Answer 49 · 2021-12-02T21:56:09.000Z

Paper: "Enjoy your observability: an industrial survey of microservice tracing and analysis" http://link.springer.com/10.1007/s10664-021-10063-9

Answer 50 · 2022-04-04T12:29:03.000Z

Faaster troubleshooting-evaluating distributed tracing approaches for serverless applications

Answer 51 · 2022-04-14T15:26:19.000Z

Timeloops: System Call Policy Learning for Containerized Microservices.

Answer 52 · 2022-04-20T04:27:14.000Z

Sampler is a tool for shell commands execution, visualization and alerting.
Configured with a simple YAML file.
https://sampler.dev/

Answer 53 · 2022-04-20T14:48:35.000Z

Stagemonitor is a Java monitoring agent that tightly integrates with time series databases like Elasticsearch, Graphite and InfluxDB to analyze graphed metrics and Kibana to analyze requests and call stacks

https://github.com/stagemonitor/stagemonitor

cc/ @gluckzhang

Answer 54 · 2022-04-27T10:16:07.000Z

Trace Server Protocol
https://github.com/eclipse-cdt-cloud/trace-server-protocol

Answer 55 · 2022-05-06T13:26:09.000Z

a sweet feature of grafana
https://grafana.com/blog/2021/07/30/how-to-use-grafana-and-prometheus-to-rickroll-your-friends-or-enemies/?src=li&mdm=social

Answer 56 · 2022-05-06T13:38:56.000Z

https://github.com/MacroPower/prometheus_video_renderer

Answer 57 · 2022-05-10T08:04:36.000Z

Zabbix open source monitoring solution for network monitoring and application monitoring of millions of metrics.
https://www.zabbix.com/

Answer 58 · 2022-05-16T13:43:04.000Z

strace is a diagnostic, debugging and instructional userspace utility for Linux. It is used to monitor and tamper with interactions between processes and the Linux kernel, which include system calls, signal deliveries, and changes of process state.
https://strace.io/

Answer 59 · 2022-05-17T08:48:49.000Z

Spring Metrics
https://docs.spring.io/spring-metrics/docs/current/public/prometheus

Answer 60 · 2022-05-21T08:15:43.000Z

Let's Trace It: Fine-Grained Serverless Benchmarking using Synchronous and Asynchronous Orchestrated Applications
https://arxiv.org/pdf/2205.07696.pdf

Answer 61 · 2022-08-13T10:04:00.000Z

Open Tracing Tools: Overview and Critical Comparison
https://arxiv.org/pdf/2207.06875.pdf

Answer 62 · 2023-03-02T08:12:26.000Z

Reliability Pillar - AWS Well-Architected Framework
https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reliability-pillar/wellarchitected-reliability-pillar.pdf

Answer 63 · 2023-04-14T16:11:41.000Z

Towards Solving the Challenge of Minimal Overhead Monitoring. (arXiv:2304.05688v1 [cs.SE])

Answer 64 · 2023-04-24T07:43:07.000Z

Lessons Learned Building a Global Synthetic Monitoring System
Talk at SREcon
https://www.usenix.org/conference/srecon22apac/presentation/sidh

Answer 65 · 2023-04-24T08:32:21.000Z

Elastic Observability
https://www.elastic.co/observability

Answer 66 · 2023-04-24T09:15:19.000Z

Dynatrace
https://www.dynatrace.com/solutions/application-monitoring/

Answer 67 · 2023-05-08T04:55:32.000Z

CausalRCA: Causal inference based precise fine-grained root cause localization for microservice applications (JSS 2023)

Answer 68 · 2023-05-21T09:28:32.000Z

A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
https://github.com/upgundecha/howtheysre

Answer 69 · 2023-05-22T04:16:39.000Z

Observability with Gitlab
https://opstrace.com/
https://about.gitlab.com/direction/monitor/observability/