timescale/prometheus-postgresql-adapter

prometheus_postgresql_adapter doesn't stop writing in HA

virasatkhan opened this issue · 6 comments

I am running 2 identical prometheus containers talking to their respective prometheus.yml files.

adapter1 is running a timescale/prometheus-postgresql-adapter:latest \ on a port (:xxxx) with lockid=1 and locktimeout=25s
adapter2 is running a timescale/prometheus-postgresql-adapter:latest \ on a port (:yyyy)

prometheus1 is configured to read/write to adapter1
prometheus2 is configured to read/write to adapter2

Both the prometheus instances have scraping intervals of 20s and that is the reason I kept the locktimeout for adapters as 25s

To maintain HA, both adapters are writing to DB based on lock acquired.

Let's say at this given point 'adapter1' has the lock and activley writing to DB. While adapter2 is paused as it cannot become a leader.

Problem I am facing is.. when I turn off prometheus1.. i am expecting the adapter2 to pick the lock and start writing. But to my surprise, adapter1 continues to write (as per my docker logs ) command.

Is there anything I am missing?

NOTE: My prometheus instances (has 20s scraping interval) and are federated to scrap specific job out of master prometheus (which has scraping interval of 1m). Hardly matters right, I am anyways going to scrap duplicate into my instance. But the question is why is my adapter2 not able to acquire lock and still shows below message.

level=debug ts=2019-05-02T15:45:03.960924634Z caller=log.go:21 msg="Scheduled election is paused. Instance can't become a leader until scheduled election is resumed (Prometheus comes up again)"

Hey @virasatkhan,

what I find confusing here is that you shut down prometheus1 but adapter1 keeps writing. I wonder what is it writing if it gets no data from prometheus? Is there maybe another prometheus instance configured to write to the same adapter?

The log you've posted from adapter2 is basically telling that it can't become a leader because the adapter is not getting any data from promethues - so apparently prometheus2 is down somehow.

Would be great if you can share logs from both adapters from the moment you shutdown prometheus1.

Let me try and re-create this condition to comment on that in much more details and get you some logs too...
A Quick Question:: Will my pg_adapter be writing some WAL's ? The reason I ask is, if due to some reason, by postgreSQL DB is down, I was expecting my pg_adapter to write some WAL files on local node where it is running and then continue writing to DB when the DB is up. However, recently when my DB was down, I see huge dent in my grafana data that I am using for prometheus to read from that DB. That dent in the continuety was throughout the time my DB was down, so basically I have no metric data for that period when the DB was down.

Hi,

The version of adapter we're using is the latest one, and we are facing the same problem.

We set up two adapters in two machines, and according the reference link: https://blog.timescale.com/prometheus-ha-postgresql-8de68d19b6f5/, we set advisory-lock-id to the same value of 1321, and leader timeout to 16s, while the scraping interval is 15s.

The strange thing is that, we started up two prometheus and one each pointed to one of the two adapters, and, the logs showing up that the all two adapters can write metrics into PG. Here are the screenshots for log files:

屏幕快照 2019-08-09 下午5 21 41

屏幕快照 2019-08-09 下午5 21 50

As I understand it, only one adapter accepts write quests, and there other not. So, any comments here for this condition?

Hey @niksajakovljevic

Same issue i'm seeing.It is really easy to replicate.Just configure 2 adapters with the same leader-election-pg-advisory-lock-id and point to the same Postgresql database.When you start it up it will work as expected you will see one node writes samples and other does not.
Now just restart your database.Both of the nodes start to write at the same time.

Somehow adpater became leader altho there was database connection issue.

Aug 22 14:20:29 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-22T14:20:29.175Z"}
Aug 22 14:20:29 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-22T14:20:29.755Z"}
Aug 22 14:20:29 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-22T14:20:29.755Z"}
Aug 22 14:20:29 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-08-22T14:20:29.755Z"}
Aug 22 14:20:30 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-22T14:20:30.328Z"}
Aug 22 14:20:30 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-08-22T14:20:30.328Z"}
Aug 22 14:20:30 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-22T14:20:30.328Z"}
Aug 22 14:20:30 xxxxxxxxxxx prometheus: ts=2019-08-22T14:20:30.736Z caller=dedupe.go:111 component=remote level=info queue=0:http://localhost:8089/write msg="Remote storage resharding" from=1 to=3
Aug 22 14:20:30 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-22T14:20:30.891Z"}
Aug 22 14:20:30 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-22T14:20:30.891Z"}
Aug 22 14:20:31 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:31","groupId":"10","level":"info","msg":"Instance became a leader","ts":"2019-08-22T14:20:31.466Z"}
Aug 22 14:20:32 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Error on Begin when writing samples","ts":"2019-08-22T14:20:32.042Z"}
Aug 22 14:20:32 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:35","err":"dial tcp xxxxxxxx:5432: connect: connection refused","level":"warn","msg":"Error sending samples to remote storage","num_samples":10,"storage":"PostgreSQL","ts":"2019-08-22T14:20:32.042Z"}
Aug 22 14:20:32 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-08-22T14:20:32.042Z"}
Aug 22 14:20:32 xxxxxxxxxxx prometheus-postgresql-adapter: {"caller":"log.go:39","err":"dial tcp xxxxxxxx:5432: connect: connection refused","level":"error","msg":"Error on Begin when writing samples","ts":"2019-08-22T14:20:32.692Z"}

Maybe this implementation gives idea how to properly handle database outages etc.
https://github.com/Klarrio/pglock