Data collection, visualization and analysis are essential to predict, detect and troubleshoot software failures.
Monitoring, alerting, distributed tracing and log aggregation tools collect data from:
- system metrics
- application metrics
- application tracing
- system and application logs
Metrics, application tracing and logging are usually handled with different tools:
- Telegraf, InfluxDB, Grafana and Kapacitor for monitoring and alerting (included anomaly detection) on metrics
- Zipkin/Jaeger for application tracing
- Splunk/Graylog for logs
This makes troubleshooting more difficult.
Design a solution for log aggregation and distributed tracing to centralize the management of log and trace information in a distributed system.
Logs record information about application events: errors, warnings and other meaningful operational events.
Traces record timing data and other information about function and service calls (Service A called endpoint X in Service B successfully and it took T ms).
- Be able to correlate application logs and traces
- Be able to create a map of service calls (direct and indirect) per user request
- Be able to add dashboards to visualize and cross information coming from logs and traces
- Prepare your system to be able to analyse data and predict failures
- Component diagram of the overall proposed solution, describing the technologies chosen for each logical block
- Interaction diagrams between the identified components
- Proof of concept of the proposed solution