DarkRiftNetworking/DarkRift

Prometheus stops receiving metrics from darkrift after some time

Shii2 opened this issue · 5 comments

Shii2 commented

Quick Description

After server running for 5-15 minutes prometheus starting complaining about invalid timestamps and stopping updating metrics.

Explanation

Seems like if no one constantly joins the server darkrift stops updating timestamps in output metrics data that eventually prevents prometheus from parsing this data. To fix that I added direct GetTimestamp() call to WriteMetric calls at lines 217, 228 and 238 in PrometheusEndpoint file (DR 2.9.1 Pro)

Just tried to reproduce this, I don't get the errors you're talking about here. Can you attach them?

From what I understand about Prometheus, the timestamps aren't supposed to be the time at which you scrape but are the time at which the metric was last updated, e.g. the time a client last connected/disconnected, so I don't think you're solution here is correct.

Prometheus does stop showing a metric if it hasn't been updated in 5 minutes, I can't really find anything on what you're supposed to do if you have data more sparse than that; perhaps other client libraries will periodically update the timestamp (doesn't seem likely as then you couldn't ever enter a no data state) or perhaps DR should just emit it more frequently.

Shii2 commented

I was getting errors like this in Prometheus console every few seconds:
Error on ingesting samples that are too old or are too far into the future" num_dropped=3

Shii2 commented

To reproduce this I used DarkRift 2.9.1 Pro and prometheus-2.25.0.windows-amd64
In prometheus config file prometheus.yml I added following lines to the end:

  - job_name:       'node'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9796']
        labels:
          group: 'canary'

In DarkRift config I have these:

  <metrics enablePerMessageMetrics="false">
    <metricsWriter type="PrometheusEndpoint">
      <settings
        host="localhost"
        path="/metrics"
        port="9796" />
    </metricsWriter>
  </metrics>

Launch DarkRift server and prometheus server, connect to darkrift server with at least one client and let them running. After some time, about 5-20 minutes prometheus server console should start spamming errors like these: Error on ingesting samples that are too old or are too far into the future" num_dropped=3

Shii2 commented

Still same issue in DarkRift Pro 2.10.0, started about 58 minutes after launching the server.
Using prometheus-2.27.1.windows-amd64 installed on same machine with DarkRift server.

level=info ts=2021-06-03T17:47:00.800Z caller=main.go:957 msg="Loading configuration file" filename=prometheus.yml
level=info ts=2021-06-03T17:47:00.803Z caller=main.go:988 msg="Completed loading of configuration file" filename=prometheus.yml totalDuration=2.9934ms remote_storage=0s web_handler=0s query_engine=0s scrape=1.9951ms scrape_sd=0s notify=0s notify_sd=0s rules=0s
level=info ts=2021-06-03T17:47:00.803Z caller=main.go:775 msg="Server is ready to receive web requests."
level=warn ts=2021-06-03T18:45:11.645Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:16.645Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:21.646Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:26.645Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:31.645Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:36.645Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:41.646Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:46.645Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:51.645Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-06-03T18:45:56.645Z caller=scrape.go:1473 component="scrape manager" scrape_pool=node target=http://localhost:9796/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3

Prometheus config:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name:       'node'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9796']
        labels:
          group: 'canary'

Ok, I still couldn't reproduce this but did some reading and this definitely looks like a mistake in the DR prometheus implementation!

According to this group most applocations shouldn't be attaching a timestamp and the scrape time should be sufficient.

The timestamps have been removed in the next version!