Pushgateway Read timed out
Drewster727 opened this issue · 2 comments
I'm experiencing an odd issue where my spark workers will randomly begin reporting that it cannot connect to my push gateway.
2020-04-13 11:41:55 ERROR ScheduledReporter:184 - Exception thrown from Reporter#report. Exception was suppressed.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:315)
at io.prometheus.client.exporter.PushGateway.pushAdd(PushGateway.java:182)
at com.banzaicloud.spark.metrics.sink.PrometheusSink$Reporter.report(PrometheusSink.scala:98)
at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:242)
at com.codahale.metrics.ScheduledReporter.lambda$start$0(ScheduledReporter.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have verified the pushgateway is up and running and I can connect to it without issue.
However, I did notice that my pushgateway piles up in memory usage. The only piece that is sending metrics are my spark workers via this package/library.
I thought perhaps it could be a pushgateway issue, but then I found this issue on the pushgateway repo:
prometheus/pushgateway#340
That seems to indicate that something is pushing metrics into the gateway (this lib) and is not disposing of the connection properly?
Any assistance would greatly be appreciated. The error is not blocking my workers but it is very annoying causing logs to get spammed and instability in the pushgateway.
jars+versions
collector-0.12.0.jar
metrics-core-4.1.2.jar
simpleclient-0.8.1.jar
simpleclient_common-0.8.1.jar
simpleclient_dropwizard-0.8.1.jar
simpleclient_pushgateway-0.8.1.jar
snakeyaml-1.16.jar
spark-metrics_2.11-2.3-3.0.1.jar
Thanks!
spark-metrics pushes metrics to Pushgateway using the pushgateway client library: https://github.com/banzaicloud/spark-metrics/blob/2.3-3.0.1/src/main/scala/com/banzaicloud/spark/metrics/sink/PrometheusSink.scala#L98 --> https://github.com/prometheus/client_java/blob/parent-0.8.1/simpleclient_pushgateway/src/main/java/io/prometheus/client/exporter/PushGateway.java#L181
If there is any connection leak it must be in the pushgateway client lib, however looking at the source code the client lib always disconnects when returns: https://github.com/prometheus/client_java/blob/parent-0.8.1/simpleclient_pushgateway/src/main/java/io/prometheus/client/exporter/PushGateway.java#L328
The increased memory usage of your Pushgateway instance might be caused by https://www.robustperception.io/common-pitfalls-when-using-the-pushgateway which can be avoided through the use of custom group keys: #46 which do not include the instance
field.
Not sure what was causing this, but disabling consistency checks per prometheus/pushgateway#340 resolved my issue...