banzaicloud/spark-metrics

Pushgateway Read timed out

Drewster727 opened this issue · 2 comments

I'm experiencing an odd issue where my spark workers will randomly begin reporting that it cannot connect to my push gateway.

2020-04-13 11:41:55 ERROR ScheduledReporter:184 - Exception thrown from Reporter#report. Exception was suppressed.
java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:315)
	at io.prometheus.client.exporter.PushGateway.pushAdd(PushGateway.java:182)
	at com.banzaicloud.spark.metrics.sink.PrometheusSink$Reporter.report(PrometheusSink.scala:98)
	at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:242)
	at com.codahale.metrics.ScheduledReporter.lambda$start$0(ScheduledReporter.java:182)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I have verified the pushgateway is up and running and I can connect to it without issue.
However, I did notice that my pushgateway piles up in memory usage. The only piece that is sending metrics are my spark workers via this package/library.

I thought perhaps it could be a pushgateway issue, but then I found this issue on the pushgateway repo:
prometheus/pushgateway#340

That seems to indicate that something is pushing metrics into the gateway (this lib) and is not disposing of the connection properly?

Any assistance would greatly be appreciated. The error is not blocking my workers but it is very annoying causing logs to get spammed and instability in the pushgateway.

jars+versions

collector-0.12.0.jar
metrics-core-4.1.2.jar
simpleclient-0.8.1.jar
simpleclient_common-0.8.1.jar
simpleclient_dropwizard-0.8.1.jar
simpleclient_pushgateway-0.8.1.jar
snakeyaml-1.16.jar
spark-metrics_2.11-2.3-3.0.1.jar

Thanks!

spark-metrics pushes metrics to Pushgateway using the pushgateway client library: https://github.com/banzaicloud/spark-metrics/blob/2.3-3.0.1/src/main/scala/com/banzaicloud/spark/metrics/sink/PrometheusSink.scala#L98 --> https://github.com/prometheus/client_java/blob/parent-0.8.1/simpleclient_pushgateway/src/main/java/io/prometheus/client/exporter/PushGateway.java#L181

If there is any connection leak it must be in the pushgateway client lib, however looking at the source code the client lib always disconnects when returns: https://github.com/prometheus/client_java/blob/parent-0.8.1/simpleclient_pushgateway/src/main/java/io/prometheus/client/exporter/PushGateway.java#L328

The increased memory usage of your Pushgateway instance might be caused by https://www.robustperception.io/common-pitfalls-when-using-the-pushgateway which can be avoided through the use of custom group keys: #46 which do not include the instance field.

cc @sancyx @baluchicken

Not sure what was causing this, but disabling consistency checks per prometheus/pushgateway#340 resolved my issue...