awesome-kyuubi/hadoop-testing

Add Hudi component into hadoop-testing

Closed this issue · 6 comments

Add Hudi component into hadoop-testing

Mix-using data lake table formats seem to have issues. i.e. the extended catalyst rules and SQL grammars have conflicts.

Mix-using data lake table formats seem to have issues. i.e. the extended catalyst rules and SQL grammars have conflicts.

Hi Mr Blue, wdyt about putting the pre-downloaded table format specified dependencies into the default .ivy spark.jars.ivy?

So when users want to play Lakehouse Suit, they can just run these command like blow (copied from hudi official spark guide):

# for spark shell:
spark-shell --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'

# for spark sql:
spark-sql --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'

This can avoid re-downloading from the remote maven repository and would not contaminate other components.

maybe some places like /opt/hudi/xxx.jar? Then the user could use spark-sql --jars /opt/hudi/xxx.jar even offline.

maybe some places like /opt/hudi/xxx.jar? Then the user could use spark-sql --jars /opt/hudi/xxx.jar even offline.

sounds good~

One more thing, Hudi depends on HFile as its metadata-store's inner format(MOR table). HFile belongs to hbase-server(version 2.4.9). However, the hbase-server depends on hadoop 2.x. So when it integrates with hadoop 3.x, it would occur NoSuchMethod exception.

I must rebuild hbase with hadoop-3 profile, then rebuild hudi-spark-bundle.

More information is here: https://hudi.apache.org/docs/troubleshooting#how-can-i-resolve-the-nosuchmethoderror-from-hbase-when-using-hudi-with-metadata-table-on-hdfs

So providing the out-of-box bundle would benefit more users.

since it's not included in the classpath by default, bundle is fine.