YotpoLtd/metorikku

How to run on aws emr?

Closed this issue · 7 comments

Hello,

I'm trying to run using AWS EMR as a spark application. I used a bootstrap action to copy the necessary files in each node. But I'm having trouble coming up with the right command.

This is what I have:

spark-submit --deploy-mode cluster --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --jars /home/some_folder/hoodie-spark-bundle-0.4.7.jar --class com.yotpo.metorikku.Metorikku s3://some_bucket/metorikku.jar -c /home/some_folder/config.yaml

It looks like it doesn't recognize -c option. Also, it can't find hoodie-spark-bundle-0.4.7.jar in the directory I copied the file. I think is confusing -c for --conf. Is the above the correct way to run it on EMR?

Thank you

@Rap70r u need to copy the metorikku jar to your driver (in ur case u are using deploy mode cluster so u need to copy the jar to all workers using a bootstrap script), i dont think u can use a remote jar that is somewhere on s3.... what we do is copying the config, job yaml files to the cluster and the metorikku jar.
the same with the hoodie jar, if ur using deploy-mode cluster then u need to make sure the jar is available in all nodes

Hi RonBarabash,

Thank you for getting back to me.
I see. Ok. I'll try to follow that pattern.
By the way, the emr must have spark 2.2, right? Does the jar work with spark 2.3?

Thank you

Hi RonBarabash,

I copied all the files to the EMR box. I tried running using the below command using Spark 2.3.0:

spark-submit --deploy-mode client --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --jars /mnt1/hoodie_spark_bundle.jar --class com.yotpo.metorikku.Metorikku /mnt1/metorikku.jar -c /mnt1/config.yaml

I got the following error:

SparkContext: Error initializing SparkContext.
javax.xml.parsers.FactoryConfigurationError: Provider for class javax.xml.parsers.DocumentBuilderFactory cannot be created
java.lang.RuntimeException: Provider for class javax.xml.parsers.DocumentBuilderFactory cannot be created
java.util.ServiceConfigurationError: javax.xml.parsers.DocumentBuilderFactory: Provider org.apache.xerces.jaxp.DocumentBuilderFactoryImpl not found

Have you seen this before?

Thank you

Yes. Just add --packages xerces:xercesImpl:2.8.0

Hi RonBarabash,

Thank you for getting back to me.
I see. Ok. I'll try to follow that pattern.
By the way, the emr must have spark 2.2, right? Does the jar work with spark 2.3?

Thank you

Nope all spark versions above 2 are supported. We're running on 2.4.4

Hi lyogev,

Thank you for getting back to me.
I resolved that issue by adding the package you suggested.
I was wondering if --package org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 also is required. Because I was getting this error:

java.lang.ClassNotFoundException: kafka.DefaultSource

After including both --packages xerces:xercesImpl:2.8.0, org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 it passed that point but I got something else down the road that I'm trying to resolve:

java.lang.ClassNotFoundException: io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient

I'm not sure why it throws that error. I have the schema registry running on localhost.
Have you seen that before?

Thank you.

Check out our spark submit here for all needed packages to run when using schema registry

- SUBMIT_COMMAND=spark-submit --jars http://central.maven.org/maven2/com/uber/hoodie/hoodie-spark-bundle/0.4.7/hoodie-spark-bundle-0.4.7.jar --repositories http://packages.confluent.io/maven/ --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3,org.apache.kafka:kafka_2.11:2.2.0,org.apache.spark:spark-avro_2.11:2.4.3,io.confluent:kafka-avro-serializer:5.1.2 --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://hive:9083 --conf spark.sql.warehouse.dir=/warehouse --class com.yotpo.metorikku.Metorikku metorikku.jar -c examples/kafka/kafka_example_cdc.yaml