dimagi/commcare-hq

[CEP] Support for Prometheus metrics

Opened this issue · 0 comments

Abstract
CommCare currently supports sending metrics to Datadog. This proposal outlines changes required to support exposing metrics compatible with Prometheus. Prometheus is an open source monitoring solution which can be hosted alongside CommCare.

Motivation
In order to improve the ability of organizations outside of Dimagi to run and support CommCare without being dependent on paid services. The specific use case currently is the ICDS program. As part of the effort to hand over the operations of CommCare to the government it is desirable to have a self hosted monitoring solution.

Specification
Some important differences between Datadog and Prometheus:

Function Datadog Promethius
Metric collection Datadog is a push based system. Agents are run on hosts which collate metrics and push them to the central Datadog API. Custom services can be instrumented which send metrics to StatsD which in turn is queried by the Datadog agent and forwarded with the other host level metrics. Promethius is primarily a pull based system. The Promethius server makes HTTP requests to configured endpoints from where it scrapes metrics. Promethius does support push metrics for certain use cases but it is not the primary method of collecting metrics.
Metric definition The Datadog client libraries allow dynamic definition of metrics via the metric name and a dynamic list of tags. Prometheus client libraries require definition of the metrics as a global class. They also require defining the metric labels at creation.

Instrumentation

Since the Prometheus client library has a more restrictive API it is recommended that a compatible Python API be created for Datadog which will allow the two to be used interchangeably. The following example illustrates the potential usage:

# this may be declared at the file level
metric_blobs_added = get_metrics_provider().counter('commcare.blobs.added.count', 'Count of blobs added', tag_names=['type_code'])

metric_blobs_added.tag(type_code=1).inc()

The metrics provider may be interchanged between Datadog and Prometheus based on the configuration values in the system. This will work in a similar fashion to how the BlobDB currently works.

Exposing metrics
The metrics provider for Datadog will continue to push metrics to a local StatsD instance.

In order to expose metrics for Prometheus it will be required to expose an additional HTTP endpoint. This endpoint can be secured by preventing access to it via the nginx proxy.

Impact on users
None

Impact on hosting
This should not impact any existing hosting but will create an alternative monitoring solution for hosters.

Backwards compatibility
Backwards compatibility with current metrics will be maintained.

Release Timeline
End of Q2 2020.

Open questions and issues
None