audienceproject/spark-dynamodb

ClassNotFoundException: com.audienceproject.spark.dynamodb.datasource.DynamoWriterFactory

Opened this issue · 0 comments

Spark Version: 2.12.10

I submit job to AWS EMR Cluster as follow:

aws emr-containers start-job-run \
--virtual-cluster-id xxx \
--name spark-pi \
--execution-role-arn arn:aws:iam::xxx:role/xxx \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://xxx/spark-scripts/xxx-spark.py",
        "entryPointArguments" : ["s3://xxx/data/xxx"],
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1 --packages com.audienceproject:spark-dynamodb_2.12:1.1.2"
        }
    }'

This results in the following exception:

21/07/30 17:33:24 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.137.191, executor 2): java.lang.ClassNotFoundException: com.audienceproject.spark.dynamodb.datasource.DynamoWriterFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
	at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

When I get a shell into the pod driver container, I see com.audienceproject_spark-dynamodb_2.12-1.1.2.jar in .ivy2 and if I get the spark-kubernetes-driver lots for that pod, I see that the dependency was brought in:

:: resolution report :: resolve 2886ms :: artifacts dl 497ms
	:: modules in use:
	...
	com.audienceproject#spark-dynamodb_2.12;1.1.2 from central in [default]
        ...
	:: evicted modules:
        ...

It looks like this guy had a similiar issue: https://githubmemory.com/repo/audienceproject/spark-dynamodb/issues/45 but the start job run request is the only place I specify the dependency:

        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1 --packages com.audienceproject:spark-dynamodb_2.12:1.1.2"

Do I have to install the package anywhere else?

This guy also had a similiar issue due to using unreleased maven issues but I am using 2.12:1.1.2 which is released: https://githubmemory.com/repo/audienceproject/spark-dynamodb/issues/47

Any ideas on how to solve?