/monitoring-uptime

Uptime monitoring configs and description.

Pools uptime monitoring system

Uptime monitoring system contains a two component: Telegraf as agent and Prometheus as server. Agents are placed on servers in three locations: Europe, Asia, USA. Configuratiion of agents contain input plugins for every monitoring pool. Two plugins are used:

  • http_response for site and api endpoints probe;
  • net_response for stratum endpoint tcp probe. For one kind of each metric all input plugins contains a additional tags. This made a grouping metrics by labels and make a common endpoint status. The output uses the worst status of endpoint.

Metrics

I'm showing only a important fields of each metric

  • http_response_content_length{host="proxy-eu",location="cn",pool="btc.com",type="api"}
  • http_response_http_response_code{host="proxy-eu",location="cn",pool="btc.com",type="api"}
  • http_response_response_string_match{host="proxy-eu",location="cn",pool="btc.com",type="api"}
  • http_response_response_time{host="proxy-eu",location="cn",pool="btc.com",type="api"}
  • http_response_result_code{host="proxy-eu",location="cn",pool="btc.com",type="api"}

In consolidated graph metrics using the last metric - http_response_result_code. Her value is equal 0 when isn't access problems, and > 0 in all other cases.

  • net_response_response_time{host="proxy-eu",location="br",pool="nicehash.com",port="3334",server="sha256.br.nicehash.com",type="stratum"}
  • net_response_result_code{host="proxy-eu",location="br",pool="nicehash.com",port="3334",server="sha256.br.nicehash.com",type="stratum"}

In consolidated graph metrics using the last metric - net_response_result_code. Her behaviour similar a http_response_result_code.

Computed metrics

Site and api metrics

In graph using consolidated metrics. Consolidation doing by Prometheus server during a query metrics. These are the requests:

max(http_response_result_code{pool=~".+"}) by (pool, location, type)

We taking a max value of metric from all locations for viewing worstly value.

sum(count_over_time(http_response_result_code{pool =~ ".+", result="success"}[1d])) without (host, instance, job, method, result_type, server, status_code) / ignoring(result) group_left sum(count_over_time(http_response_result_code{pool =~ ".+"}[1d])) without (host, instance, job, method, result, result_type, server, status_code) * 100

Calculating a uptime percent by last 24 hours.

sum(count_over_time(http_response_result_code{pool =~ ".+", result="success"}[1d] offset 50000s)) without (host, instance, job, method, result_type, server, status_code) / ignoring(result) group_left sum(count_over_time(http_response_result_code{pool =~ ".+"}[1d] offset 50000s)) without (host, instance, job, method, result, result_type, server, status_code) * 100

Calculating a uptime percent from day start. Start position set as seconds offset in two places of query.

Stratum metrics

Stratum metrics computed similar a http metrics.

max(net_response_result_code{pool=~".+"}) by (pool, location, type)

sum(count_over_time(net_response_result_code{type="stratum",result="success"}[1d])) without (port, server, host, instance, protocol, job, type) / ignoring(result) group_left sum(count_over_time(net_response_result_code{type="stratum"}[1d])) without (port, server, host, instance, protocol, job, type, result) * 100

sum(count_over_time(net_response_result_code{type="stratum",result="success"}[1d] offset 50000s)) without (port, server, host, instance, protocol, job, type) / ignoring(result) group_left sum(count_over_time(net_response_result_code{type="stratum"}[1d] offset 50000s)) without (port, server, host, instance, protocol, job, type, result) * 100