banzaicloud/spark-metrics

Caused by: java.lang.ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink

nemo83 opened this issue · 18 comments

Hello,

first of all thanks for putting this lib together. I'm a fan of spark and prometheus but there was nothing to bridge these two worlds and you guys did an amazing job.

I've started using you lib few weeks ago and everything was working fine will a few days ago a new structured streaming job on emr 5.14 failed to launch executors..

The error:

18/06/08 12:46:47 ERROR MetricsSystem: Sink class com.banzaicloud.spark.metrics.sink.PrometheusSink cannot be instantiated
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1854)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:235)
        at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
        at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
        at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
        at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:364)
        at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:200)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:228)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)

The ARGS[] of the aws emr create step:

    --repositories, https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases, \
    --packages, '"com.banzaicloud:spark-metrics_2.11:2.3-1.0.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0"', \
    --conf, spark.driver.extraJavaOptions="${CONFIG_FILE}", \
    --conf, spark.metrics.conf="/tmp/prometheus-sink.conf", \
    --conf, spark.metrics.namespace="interleaving-stream", \

And the libs in the build.sbt:

"io.prometheus" % "simpleclient" % "0.3.0",
      "io.prometheus" % "simpleclient_dropwizard" % "0.3.0",
      "io.prometheus" % "simpleclient_pushgateway" % "0.3.0",
      "io.dropwizard.metrics" % "metrics-core" % "3.1.2",
      ```

Do you guys have any clue how's that possible that the container terminates with the class not found?

Note that the pushing  to the prometheus gateway works fine.

Hi @nemo83 , is this failing consistently?

We use here GitHub as maven repository to hold the spark-metrics jars (see https://github.com/banzaicloud/spark-metrics/tree/master/maven-repo/releases/com/banzaicloud/spark-metrics_2.11/2.3-1.0.0).

If the class not found exception is not consistent than probably there was some temporary issue with downloading the jars from GitHub.

If it's consistent failure than I suspect that there is some class path issue. Can you list all the jars that are on the class path of the executors?

Hi @nemo83 , do you still experience this issue?

@rmuniz527 can you give it a try again. It might be an intermittent GH access issue.

`$ curl -v -L raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases

  • Trying 151.101.0.133...
  • TCP_NODELAY set
  • Connected to raw.github.com (151.101.0.133) port 80 (#0)

GET /banzaicloud/spark-metrics/master/maven-repo/releases HTTP/1.1
Host: raw.github.com
User-Agent: curl/7.55.1
Accept: /

< HTTP/1.1 301 Moved Permanently
< Server: Varnish
< Retry-After: 0
< Location: https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases
< Content-Length: 0
< Accept-Ranges: bytes
< Date: Thu, 04 Oct 2018 19:56:35 GMT
< Via: 1.1 varnish
< Connection: close
< X-Served-By: cache-lax8634-LAX
< X-Cache: HIT
< X-Cache-Hits: 0
<

  • Closing connection 0
  • Issue another request to this URL: 'https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases'
  • Trying 151.101.128.133...
  • TCP_NODELAY set
  • Connected to raw.github.com (151.101.128.133) port 443 (#1)
  • ALPN, offering http/1.1
  • Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@strength
  • successfully set certificate verify locations:
  • CAfile: /etc/pki/tls/certs/ca-bundle.crt
    CApath: none
  • TLSv1.2 (OUT), TLS header, Certificate Status (22):
  • TLSv1.2 (OUT), TLS handshake, Client hello (1):
  • TLSv1.2 (IN), TLS handshake, Server hello (2):
  • TLSv1.2 (IN), TLS handshake, Certificate (11):
  • TLSv1.2 (IN), TLS handshake, Server key exchange (12):
  • TLSv1.2 (IN), TLS handshake, Server finished (14):
  • TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
  • TLSv1.2 (OUT), TLS change cipher, Client hello (1):
  • TLSv1.2 (OUT), TLS handshake, Finished (20):
  • TLSv1.2 (IN), TLS change cipher, Client hello (1):
  • TLSv1.2 (IN), TLS handshake, Finished (20):
  • SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
  • ALPN, server accepted to use http/1.1
  • Server certificate:
  • subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=www.github.com
  • start date: Mar 23 00:00:00 2017 GMT
  • expire date: May 13 12:00:00 2020 GMT
  • subjectAltName: host "raw.github.com" matched cert's "*.github.com"
  • issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA
  • SSL certificate verify ok.

GET /banzaicloud/spark-metrics/master/maven-repo/releases HTTP/1.1
Host: raw.github.com
User-Agent: curl/7.55.1
Accept: /

< HTTP/1.1 301 Moved Permanently
< Location: https://raw.githubusercontent.com/banzaicloud/spark-metrics/master/maven-repo/releases
< Content-Length: 0
< Accept-Ranges: bytes
< Date: Thu, 04 Oct 2018 19:56:35 GMT
< Via: 1.1 varnish
< Age: 0
< Connection: keep-alive
< X-Served-By: cache-bur17531-BUR
< X-Cache: MISS
< X-Cache-Hits: 0
< Vary: Accept-Encoding
< X-Fastly-Request-ID: d2f61d0beb7d5113f92668acff72d72a66da1678
<

  • Connection #1 to host raw.github.com left intact
  • Issue another request to this URL: 'https://raw.githubusercontent.com/banzaicloud/spark-metrics/master/maven-repo/releases'
  • Trying 151.101.64.133...
  • TCP_NODELAY set
  • Connected to raw.githubusercontent.com (151.101.64.133) port 443 (#2)
  • ALPN, offering http/1.1
  • Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@strength
  • successfully set certificate verify locations:
  • CAfile: /etc/pki/tls/certs/ca-bundle.crt
    CApath: none
  • TLSv1.2 (OUT), TLS header, Certificate Status (22):
  • TLSv1.2 (OUT), TLS handshake, Client hello (1):
  • TLSv1.2 (IN), TLS handshake, Server hello (2):
  • TLSv1.2 (IN), TLS handshake, Certificate (11):
  • TLSv1.2 (IN), TLS handshake, Server key exchange (12):
  • TLSv1.2 (IN), TLS handshake, Server finished (14):
  • TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
  • TLSv1.2 (OUT), TLS change cipher, Client hello (1):
  • TLSv1.2 (OUT), TLS handshake, Finished (20):
  • TLSv1.2 (IN), TLS change cipher, Client hello (1):
  • TLSv1.2 (IN), TLS handshake, Finished (20):
  • SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
  • ALPN, server accepted to use http/1.1
  • Server certificate:
  • subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=www.github.com
  • start date: Mar 23 00:00:00 2017 GMT
  • expire date: May 13 12:00:00 2020 GMT
  • subjectAltName: host "raw.githubusercontent.com" matched cert's "*.githubusercontent.com"
  • issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA
  • SSL certificate verify ok.

GET /banzaicloud/spark-metrics/master/maven-repo/releases HTTP/1.1
Host: raw.githubusercontent.com
User-Agent: curl/7.55.1
Accept: /

< HTTP/1.1 404 Not Found
< Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
< Strict-Transport-Security: max-age=31536000
< X-Content-Type-Options: nosniff
< X-Frame-Options: deny
< X-XSS-Protection: 1; mode=block
< X-GitHub-Request-Id: CE66:5382:1D1B70F:1EA6722:5BB67073
< Content-Length: 15
< Accept-Ranges: bytes
< Date: Thu, 04 Oct 2018 19:56:35 GMT
< Via: 1.1 varnish
< Connection: keep-alive
< X-Served-By: cache-lax8631-LAX
< X-Cache: MISS
< X-Cache-Hits: 0
< X-Timer: S1538682995.485670,VS0,VE27
< Vary: Authorization,Accept-Encoding
< Access-Control-Allow-Origin: *
< X-Fastly-Request-ID: 1ef72ba8a56a7b6d56b540881339aa80cceb943f
< Expires: Thu, 04 Oct 2018 20:01:35 GMT
< Source-Age: 0
<
404: Not Found

  • Connection #2 to host raw.githubusercontent.com left intact
    `

@rmunoz527 can you try if downloading the jar directly with curl -v -L https://raw.githubusercontent.com/banzaicloud/spark-metrics/master/maven-repo/releases/com/banzaicloud/spark-metrics_2.11/2.3-1.1.0/spark-metrics_2.11-2.3-1.1.0.jar -O /var/tmp results in the same 404 Not Found error?

so that works..
`curl -v -L https://raw.githubusercontent.com/banzaicloud/spark-metrics/master/maven-repo/releases/com/banzaicloud/spark-metrics_2.11/2.3-1.1.0/spark-metrics_2.11-2.3-1.1.0.jar -O
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 151.101.0.133...

  • TCP_NODELAY set
  • Connected to raw.githubusercontent.com (151.101.0.133) port 443 (#0)
  • ALPN, offering http/1.1
  • Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@strength
  • successfully set certificate verify locations:
  • CAfile: /etc/pki/tls/certs/ca-bundle.crt
    CApath: none
  • TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
  • TLSv1.2 (OUT), TLS handshake, Client hello (1):
    } [512 bytes data]
  • TLSv1.2 (IN), TLS handshake, Server hello (2):
    { [108 bytes data]
  • TLSv1.2 (IN), TLS handshake, Certificate (11):
    { [3182 bytes data]
  • TLSv1.2 (IN), TLS handshake, Server key exchange (12):
    { [333 bytes data]
  • TLSv1.2 (IN), TLS handshake, Server finished (14):
    { [4 bytes data]
  • TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
    } [70 bytes data]
  • TLSv1.2 (OUT), TLS change cipher, Client hello (1):
    } [1 bytes data]
  • TLSv1.2 (OUT), TLS handshake, Finished (20):
    } [16 bytes data]
  • TLSv1.2 (IN), TLS change cipher, Client hello (1):
    { [1 bytes data]
  • TLSv1.2 (IN), TLS handshake, Finished (20):
    { [16 bytes data]
  • SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
  • ALPN, server accepted to use http/1.1
  • Server certificate:
  • subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=www.github.com
  • start date: Mar 23 00:00:00 2017 GMT
  • expire date: May 13 12:00:00 2020 GMT
  • subjectAltName: host "raw.githubusercontent.com" matched cert's "*.githubusercontent.com"
  • issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA
  • SSL certificate verify ok.
    } [5 bytes data]

GET /banzaicloud/spark-metrics/master/maven-repo/releases/com/banzaicloud/spark-metrics_2.11/2.3-1.1.0/spark-metrics_2.11-2.3-1.1.0.jar HTTP/1.1
Host: raw.githubusercontent.com
User-Agent: curl/7.55.1
Accept: /

{ [5 bytes data]
< HTTP/1.1 200 OK
< Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
< Strict-Transport-Security: max-age=31536000
< X-Content-Type-Options: nosniff
< X-Frame-Options: deny
< X-XSS-Protection: 1; mode=block
< ETag: "8f98f1a4dd154f67bf7334d763cfaa5dc7b20c57"
< Content-Type: application/octet-stream
< Cache-Control: max-age=300
< X-Geo-Block-List:
< X-GitHub-Request-Id: 4F12:5384:63E2795:68F86BC:5BB67603
< Content-Length: 57715
< Accept-Ranges: bytes
< Date: Thu, 04 Oct 2018 20:21:07 GMT
< Via: 1.1 varnish
< Connection: keep-alive
< X-Served-By: cache-lax8634-LAX
< X-Cache: HIT
< X-Cache-Hits: 1
< X-Timer: S1538684467.166186,VS0,VE1
< Vary: Authorization,Accept-Encoding
< Access-Control-Allow-Origin: *
< X-Fastly-Request-ID: 21ec3c1db574146053c907c33faf7ec2a7a5a2ee
< Expires: Thu, 04 Oct 2018 20:26:07 GMT
< Source-Age: 48
<
{ [1919 bytes data]
100 57715 100 57715 0 0 57715 0 0:00:01 --:--:-- 0:00:01 368k

  • Connection #0 to host raw.githubusercontent.com left intact
    `

What's strange is it seems to be ok for driver using provided configuration settings, just the executors are having issue launching because of jar not available

@rmunoz527 can you describe how your executors try to pull the spark-metrics jars? Are you running Spark on Kubernetes?

@stoader I am running spark on yarn. I worked around the issue by downloading the jars locally and setting spark.executor.extraClassPath=$JARS. Off topic question but I see the metrics for executors have the executor id in the metric name. Was wondering if that is by design. Becomes a problem for me in trying to graph all executor metrics for component. An example would be

application_1534579456045_1197_1_executor_filesystem_file_largeRead_ops application_1534579456045_1197_2_executor_filesystem_file_largeRead_ops application_1534579456045_1197_3_executor_filesystem_file_largeRead_ops application_1534579456045_1197_4_executor_filesystem_file_largeRead_ops

@rmunoz527 the name of the metrics is generated inside Spark so it's out of scope of the Sink. However you can alter the name of the metrics either using (metrics-name-capture-regex / metrics-name-replacement)[https://github.com/banzaicloud/spark-metrics/blob/master/PrometheusSink.md#how-to-enable-prometheussink-in-spark] or Prometheus relabelling.

Regarding executors it is expected to have the Sink jar available on the machine where the executor runs and specify its location for executor through spark.executor.extraClassPath. Yarn might be capable to do download JARS upfront before the executor is started however I'm not familiar with how spark-submit works with Yarn

Same issue. All classes are packaged into single jar and I have no problems with that until to day.
I have checked that com.banzaicloud.spark.metrics.sink.PrometheusSink is prensent in jar file but spark can not find it when application run in emr.

@nikita-clearscale if all classes are packaged into a single jar and was working before for you this sounds like an issue related to the environment. Can you repro this in your local env? Also do you have logs that shows what the classloader does?

@stoader
Thanks for quick reply

Can you repro this in your local env?

No, I have not, because emr have different loading mechanism.

Also do you have logs that shows what the classloader does?

Could ouy please write a link or advice how to that with emr in aws?

tisy commented

Hi I'm having same issue. All is in the fat jar but getting the following error:

Uncaught exception: java.lang.ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
at org.apache.spark.deploy.yarn.ApplicationMaster.createAllocator(ApplicationMaster.scala:454)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:481)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

@nikita-clearscale I'm not familiar with EMR, we are running Spark on Kubernetes. On EMR can you specify options for the spark-submit command? If yes, can you try adding --conf "spark.executor.extraJavaOptions=-verbose:class" --conf "spark.driver.extraJavaOptions=-verbose:class" to see if outputs more verbose logs that may give any hint on what's going on behind the scenes?

@nikita-clearscale can you try this if works for you: #28 (comment)