ClassNotFoundException: com.audienceproject.spark.dynamodb.datasource.DynamoWriterFactory
Opened this issue · 0 comments
Spark Version: 2.12.10
I submit job to AWS EMR Cluster as follow:
aws emr-containers start-job-run \
--virtual-cluster-id xxx \
--name spark-pi \
--execution-role-arn arn:aws:iam::xxx:role/xxx \
--release-label emr-6.2.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://xxx/spark-scripts/xxx-spark.py",
"entryPointArguments" : ["s3://xxx/data/xxx"],
"sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1 --packages com.audienceproject:spark-dynamodb_2.12:1.1.2"
}
}'
This results in the following exception:
21/07/30 17:33:24 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.137.191, executor 2): java.lang.ClassNotFoundException: com.audienceproject.spark.dynamodb.datasource.DynamoWriterFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
When I get a shell into the pod driver container, I see com.audienceproject_spark-dynamodb_2.12-1.1.2.jar
in .ivy2 and if I get the spark-kubernetes-driver
lots for that pod, I see that the dependency was brought in:
:: resolution report :: resolve 2886ms :: artifacts dl 497ms
:: modules in use:
...
com.audienceproject#spark-dynamodb_2.12;1.1.2 from central in [default]
...
:: evicted modules:
...
It looks like this guy had a similiar issue: https://githubmemory.com/repo/audienceproject/spark-dynamodb/issues/45 but the start job run request is the only place I specify the dependency:
"sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1 --packages com.audienceproject:spark-dynamodb_2.12:1.1.2"
Do I have to install the package anywhere else?
This guy also had a similiar issue due to using unreleased maven issues but I am using 2.12:1.1.2 which is released: https://githubmemory.com/repo/audienceproject/spark-dynamodb/issues/47
Any ideas on how to solve?