banzaicloud/spark-metrics

Parsing error

azelezni opened this issue · 6 comments

I'm unable to send any metrics to prometheus pushgateway, getting the following error:

2019-01-16 08:16:41 INFO  PrometheusSink:54 - metricsNamespace=None, sparkAppName=None, sparkAppId=None, executorId=None
2019-01-16 08:16:41 INFO  PrometheusSink:54 - role=shuffle, job=shuffle
2019-01-16 08:16:41 INFO  PushGatewayWithTimestamp:217 - Sending metrics data to 'http://fkpr-prometheus-pushgateway.fkpr:9091/metrics/job/shuffle/role/shuffle'
2019-01-16 08:16:41 INFO  PushGatewayWithTimestamp:247 - Error response from http://fkpr-prometheus-pushgateway.fkpr:9091/metrics/job/shuffle/role/shuffle
2019-01-16 08:16:41 INFO  PushGatewayWithTimestamp:250 - text format parsing error in line 244: second HELP line for metric name "HiveExternalCatalog_fileCacheHits"
2019-01-16 08:16:41 ERROR PushGatewayWithTimestamp:255 - Sending metrics failed due to: 
java.io.IOException: Response code from http://fkpr-prometheus-pushgateway.fkpr:9091/metrics/job/shuffle/role/shuffle was 400
	at com.banzaicloud.metrics.prometheus.client.exporter.PushGatewayWithTimestamp.doRequest(PushGatewayWithTimestamp.java:252)
	at com.banzaicloud.metrics.prometheus.client.exporter.PushGatewayWithTimestamp.pushAdd(PushGatewayWithTimestamp.java:168)
	at com.banzaicloud.spark.metrics.sink.PrometheusSink$Reporter.report(PrometheusSink.scala:122)
	at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162)
	at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Hi @azelezni , can you share your metrics-name-capture-regex, name-replacement settings of your PrometheusSink config?

What version of spark-metrics library are you using?

Hi @stoader, I'm using spark 2.3.2 with the latest spark-metrics release,
I was able to get around this using the following metrics.properties

*.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
*.sink.prometheus.pushgateway-address-protocol=http
*.sink.prometheus.pushgateway-address=fkpr-prometheus-pushgateway.fkpr:9091
*.sink.prometheus.period=10
*.sink.prometheus.unit=seconds
*.sink.prometheus.pushgateway-enable-timestamp=false
*.sink.prometheus.enable-dropwizard-collector=true
*.sink.prometheus.enable-jmx-collector=false
master.sink.prometheus.metrics-name-capture-regex=(.*)
master.sink.prometheus.metrics-name-replacement=master_$1
worker.sink.prometheus.metrics-name-capture-regex=(.*)
worker.sink.prometheus.metrics-name-replacement=worker_$1
executor.sink.prometheus.metrics-name-capture-regex=(.*)
executor.sink.prometheus.metrics-name-replacement=executor_$1
driver.sink.prometheus.metrics-name-capture-regex=(.*)
driver.sink.prometheus.metrics-name-replacement=driver_$1
applications.sink.prometheus.metrics-name-capture-regex=(.*)
applications.sink.prometheus.metrics-name-replacement=app_$1

@azelezni can you provide the metrics.properties that repros the issue with?

Note that the exception above was published from shuffle service

2019-01-16 08:16:41 INFO  PrometheusSink:54 - metricsNamespace=None, sparkAppName=None, sparkAppId=None, executorId=None
2019-01-16 08:16:41 INFO  PrometheusSink:54 - role=shuffle, job=shuffle

Are you running external shuffle service as well?

If not than the reason why metrics are being reported as coming from shuffle is that spark-metrics is currently prepared for spark jobs where metrics are published from driver, executor and shuffle service and not prepared for standalone spark depoyments (see https://github.com/banzaicloud/spark-metrics/blob/2.3-2.0.4/src/main/scala/com/banzaicloud/spark/metrics/sink/PrometheusSink.scala#L79)

The following metrics.properties causes the error:

*.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
*.sink.prometheus.pushgateway-address-protocol=http
*.sink.prometheus.pushgateway-address=fkpr-prometheus-pushgateway.fkpr:9091
*.sink.prometheus.period=10
*.sink.prometheus.unit=seconds
*.sink.prometheus.pushgateway-enable-timestamp=false
*.sink.prometheus.enable-dropwizard-collector=true
*.sink.prometheus.enable-jmx-collector=false

Yes I'm running spark standalone, however I think that with the regex replacement it's good enough for my needs.

Can you do re-run with debug log level enabled ? That would log the payload being sent to push gateway.

Closing this issue. Please re-open if this issue surfaces again