Error when running metorikku-standalone.jar: DataNucleus not found

Question

Error when running metorikku-standalone.jar: DataNucleus not found

Closed this issue 5 years ago · 7 comments

Hello,

I'm running metorikku-standalone.jar with hoodie-spark-bundle-0.4.7.jar locally and I'm getting the following error:

NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.

Can you please help?

Thank you

Answer 1 · 2019-12-03T20:59:36.000Z

Hi,
Can you send here the full command you are using? and the the job file?

Thanks

Answer 2 · 2019-12-03T21:53:35.000Z

Hi lyogev,

Thank you for getting back to me.
The metric file I'm using is exactly like this one:
metric.yaml: https://raw.githubusercontent.com/YotpoLtd/metorikku/master/examples/kafka/kafka2hudi_cdc.yaml

And the config file is exactly like this:
config.yaml: https://raw.githubusercontent.com/YotpoLtd/metorikku/master/examples/kafka/kafka_example_cdc.yaml

The only difference in the config file is I replaced kafka, schema-registry and hive with localhost.

localhost:9092
localhost:8081
localhost:10000

And here is the full command:
java -Dspark.master=local[*] -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -cp hadoop-aws-2.7.5.jar -cp aws-java-sdk-1.7.4.jar -cp hoodie-spark-bundle-0.4.7.jar -cp metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml

Thank you

Answer 3 · 2019-12-03T22:28:49.000Z

Can you add this jar to the classpath as well?
https://repo1.maven.org/maven2/org/datanucleus/datanucleus-core/3.2.10/datanucleus-core-3.2.10.jar

It's not being packaged with assembled spark

Answer 4 · 2019-12-04T17:41:25.000Z

Hi lyogev,

After adding couple of datanucleus libraries that error went away. I can see data written to the output now. Thank you.
However, I'm getting a crash at a later point.
The error is: ERROR MicroBatchExecution: Query
Failed to get update last commit time synced
NoSuchObjectException(message:default.hoodie_test table not found)

Have you seen this error before? I'm trying to figure it out.
Also, I'm not passing hoodie-spark-bundle-0.4.7.jar to the class path. I believe is included into the standalone jar file.

Command:
java -Dspark.master=local[*] -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -cp datanucleus-core-3.2.10.jar:datanucleus-api-jdo-3.2.8.jar:datanucleus-rdbms-3.2.9.jar:metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml

Thank you

Answer 5 · 2019-12-04T22:23:22.000Z

Regarding hoodie, yes, it's not needed in the classpath, it's bundled already (in the standalone version).
Regarding the error, what you are missing here is hive configuration, hoodie is heavily coupled to hive, you need spark to write to hive and you need hoodie to write to hive.
So first your spark command needs to be aware of hive:
-Dspark.hadoop.hive.metastore.uris=thrift://hive:9083 -Dspark.sql.catalogImplementation=hive

Then in the job file make sure hoodie output is configured with:
hiveJDBCURL: jdbc:hive2://hive:10000

As you see spark communicates directly to hive metastore, and hoodie communicates with hive server.

I think you can skip the hive sync simply by omitting the tableName from the hive output in the metric file, but I never tested this.
If that doesn't work, omit tableName and add:

extraOptions:
  hoodie.table.name: test

Answer 6 · 2019-12-04T23:22:17.000Z

Hi lyogev,

That was it. I simply added those 2 extra configs to the command and it worked.

Command:
java -Dspark.master=local[*] -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.hadoop.hive.metastore.uris=thrift://localhost:9083 -Dspark.sql.catalogImplementation=hive -cp datanucleus-core-3.2.10.jar:datanucleus-api-jdo-3.2.8.jar:datanucleus-rdbms-3.2.9.jar:metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml

I knew it had something to do with that config value. I was modifying hive.metastore.uris in hive-site.xml all day, but never worked. After seeing your suggestion, I added to the command and it worked. I'm not sure why it wasn't able to get the value from hive-site.xml.

Thank you so much.

Answer 7 · 2019-12-05T10:49:53.000Z

To use hive-site files etc., you need to use the regular metorikku version, not the standalone and use spark-submit, then it relates to the config folders.