YotpoLtd/metorikku

Error when running metorikku-standalone.jar: DataNucleus not found

Closed this issue · 7 comments

Hello,

I'm running metorikku-standalone.jar with hoodie-spark-bundle-0.4.7.jar locally and I'm getting the following error:

NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.

Can you please help?

Thank you

Hi,
Can you send here the full command you are using? and the the job file?

Thanks

Hi lyogev,

Thank you for getting back to me.
The metric file I'm using is exactly like this one:
metric.yaml: https://raw.githubusercontent.com/YotpoLtd/metorikku/master/examples/kafka/kafka2hudi_cdc.yaml

And the config file is exactly like this:
config.yaml: https://raw.githubusercontent.com/YotpoLtd/metorikku/master/examples/kafka/kafka_example_cdc.yaml

The only difference in the config file is I replaced kafka, schema-registry and hive with localhost.

  • localhost:9092
  • localhost:8081
  • localhost:10000

And here is the full command:
java -Dspark.master=local[*] -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -cp hadoop-aws-2.7.5.jar -cp aws-java-sdk-1.7.4.jar -cp hoodie-spark-bundle-0.4.7.jar -cp metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml

Thank you

Can you add this jar to the classpath as well?
https://repo1.maven.org/maven2/org/datanucleus/datanucleus-core/3.2.10/datanucleus-core-3.2.10.jar

It's not being packaged with assembled spark

Hi lyogev,

After adding couple of datanucleus libraries that error went away. I can see data written to the output now. Thank you.
However, I'm getting a crash at a later point.
The error is: ERROR MicroBatchExecution: Query
Failed to get update last commit time synced
NoSuchObjectException(message:default.hoodie_test table not found)

Have you seen this error before? I'm trying to figure it out.
Also, I'm not passing hoodie-spark-bundle-0.4.7.jar to the class path. I believe is included into the standalone jar file.

Command:
java -Dspark.master=local[*] -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -cp datanucleus-core-3.2.10.jar:datanucleus-api-jdo-3.2.8.jar:datanucleus-rdbms-3.2.9.jar:metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml

Thank you

Regarding hoodie, yes, it's not needed in the classpath, it's bundled already (in the standalone version).
Regarding the error, what you are missing here is hive configuration, hoodie is heavily coupled to hive, you need spark to write to hive and you need hoodie to write to hive.
So first your spark command needs to be aware of hive:
-Dspark.hadoop.hive.metastore.uris=thrift://hive:9083 -Dspark.sql.catalogImplementation=hive

Then in the job file make sure hoodie output is configured with:
hiveJDBCURL: jdbc:hive2://hive:10000

As you see spark communicates directly to hive metastore, and hoodie communicates with hive server.

I think you can skip the hive sync simply by omitting the tableName from the hive output in the metric file, but I never tested this.
If that doesn't work, omit tableName and add:

extraOptions:
  hoodie.table.name: test

Hi lyogev,

That was it. I simply added those 2 extra configs to the command and it worked.

Command:
java -Dspark.master=local[*] -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.hadoop.hive.metastore.uris=thrift://localhost:9083 -Dspark.sql.catalogImplementation=hive -cp datanucleus-core-3.2.10.jar:datanucleus-api-jdo-3.2.8.jar:datanucleus-rdbms-3.2.9.jar:metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml

I knew it had something to do with that config value. I was modifying hive.metastore.uris in hive-site.xml all day, but never worked. After seeing your suggestion, I added to the command and it worked. I'm not sure why it wasn't able to get the value from hive-site.xml.

Thank you so much.

To use hive-site files etc., you need to use the regular metorikku version, not the standalone and use spark-submit, then it relates to the config folders.