stackabletech/demos

HBase Spark Demo

Jimvin opened this issue · 6 comments

Loading data into HBase is not trivial. We want the demo to show how this can be done and to provide some guidance and best practice.

Aims

  • Load data row by row (NiFi)
  • Batch processing CSV files (MapReduce)
  • Direct load of HFiles
  • Test HBase Spark connector (stackabletech/stackablectl#71)

Tasks

  • Load data into HDFS from S3
  • Parse CSV and create HFiles
  • Load incremental HFiles into HBase
  • Load a streaming data source into HBase
  • Stackable cluster configuration
  • Verify the data is there (sanity check) using HBase shell
  • Create Phoenix view over table
  • Configure Phoenix as a data source in SuperSet
  • Create a visualisation using Phoenix JDBC and SuperSet
  • Query HBase using Spark HBase connector

## Learning Points and Challenges

  • Where does DistCP and HBase bulk load run, given there is no YARN cluster?
  • Are these jobs scalable?
  • Can we go near real time dashboards in Grafana and see instant updates
  • Stress testing
  • Test HBase region management - can we watch this in real time as part of a demo?

Choose your JAVA version first. In october 2022 it only compiles and tests successfully with JAVA8. However, we depend on JAVA11 in our images.

mvn -Dspark.version=3.3.0 -Dscala.version=2.12.14 -Dhadoop-three.version=3.3.2 -Dscala.binary.version=2.12 -Dhbase.version=2.4.12 -DrecompileMode=all clean package

.jars can be found in Nexus

This shows how to access hbase using spark shell https://kontext.tech/article/628/spark-connect-to-hbase

Hi @Jimvin,
in case you want to continue the hbase-spark-connector test during my holiday you will find the status quo on branch 87 in stackablectl

After updating the hbase connector repo my maven build fails:

[INFO] --- gmaven-plugin:1.5:execute (default) @ hbase-spark ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache HBase - Spark 1.0.1-SNAPSHOT:
[INFO]
[INFO] Apache HBase - Spark ............................... SUCCESS [  3.120 s]
[INFO] Apache HBase - Spark Protocol ...................... SUCCESS [  3.778 s]
[INFO] Apache HBase - Spark Protocol (Shaded) ............. SUCCESS [  1.922 s]
[INFO] Apache HBase - Spark Connector ..................... FAILURE [  4.405 s]
[INFO] Apache HBase - Spark Integration Tests ............. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  13.905 s
[INFO] Finished at: 2022-09-27T22:34:54+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.gmaven:gmaven-plugin:1.5:execute (default) on project hbase-spark: Execution default of goal org.codehaus.gmaven:gmaven-plugin:1.5:execute failed: An API incompatibility was encountered while executing org.codehaus.gmaven:gmaven-plugin:1.5:execute: java.lang.ExceptionInInitializerError: null
[ERROR] -----------------------------------------------------
[ERROR] realm =    plugin>org.codehaus.gmaven:gmaven-plugin:1.5
[ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
[ERROR] urls[0] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/gmaven-plugin/1.5/gmaven-plugin-1.5.jar
[ERROR] urls[1] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/runtime/gmaven-runtime-api/1.5/gmaven-runtime-api-1.5.jar
[ERROR] urls[2] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/feature/gmaven-feature-api/1.5/gmaven-feature-api-1.5.jar
[ERROR] urls[3] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/runtime/gmaven-runtime-loader/1.5/gmaven-runtime-loader-1.5.jar
[ERROR] urls[4] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/feature/gmaven-feature-support/1.5/gmaven-feature-support-1.5.jar
[ERROR] urls[5] = file:/Users/Simon/.m2/repository/org/codehaus/gmaven/runtime/gmaven-runtime-support/1.5/gmaven-runtime-support-1.5.jar
[ERROR] urls[6] = file:/Users/Simon/.m2/repository/org/sonatype/gshell/gshell-io/2.4/gshell-io-2.4.jar
[ERROR] urls[7] = file:/Users/Simon/.m2/repository/org/codehaus/plexus/plexus-utils/3.0/plexus-utils-3.0.jar
[ERROR] urls[8] = file:/Users/Simon/.m2/repository/com/thoughtworks/qdox/qdox/1.12/qdox-1.12.jar
[ERROR] urls[9] = file:/Users/Simon/.m2/repository/org/apache/maven/shared/file-management/1.2.1/file-management-1.2.1.jar
[ERROR] urls[10] = file:/Users/Simon/.m2/repository/org/apache/maven/shared/maven-shared-io/1.1/maven-shared-io-1.1.jar
[ERROR] urls[11] = file:/Users/Simon/.m2/repository/org/apache/xbean/xbean-reflect/3.4/xbean-reflect-3.4.jar
[ERROR] urls[12] = file:/Users/Simon/.m2/repository/log4j/log4j/1.2.12/log4j-1.2.12.jar
[ERROR] urls[13] = file:/Users/Simon/.m2/repository/commons-logging/commons-logging-api/1.1/commons-logging-api-1.1.jar
[ERROR] urls[14] = file:/Users/Simon/.m2/repository/com/google/collections/google-collections/1.0/google-collections-1.0.jar
[ERROR] urls[15] = file:/Users/Simon/.m2/repository/org/apache/maven/reporting/maven-reporting-impl/2.0.4.1/maven-reporting-impl-2.0.4.1.jar
[ERROR] urls[16] = file:/Users/Simon/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.1/plexus-interpolation-1.1.jar
[ERROR] urls[17] = file:/Users/Simon/.m2/repository/commons-validator/commons-validator/1.2.0/commons-validator-1.2.0.jar
[ERROR] urls[18] = file:/Users/Simon/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar
[ERROR] urls[19] = file:/Users/Simon/.m2/repository/commons-digester/commons-digester/1.6/commons-digester-1.6.jar
[ERROR] urls[20] = file:/Users/Simon/.m2/repository/commons-logging/commons-logging/1.0.4/commons-logging-1.0.4.jar
[ERROR] urls[21] = file:/Users/Simon/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar
[ERROR] urls[22] = file:/Users/Simon/.m2/repository/xml-apis/xml-apis/1.0.b2/xml-apis-1.0.b2.jar
[ERROR] urls[23] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-core/1.0-alpha-10/doxia-core-1.0-alpha-10.jar
[ERROR] urls[24] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.0-alpha-10/doxia-sink-api-1.0-alpha-10.jar
[ERROR] urls[25] = file:/Users/Simon/.m2/repository/org/apache/maven/reporting/maven-reporting-api/2.0.4/maven-reporting-api-2.0.4.jar
[ERROR] urls[26] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-site-renderer/1.0-alpha-10/doxia-site-renderer-1.0-alpha-10.jar
[ERROR] urls[27] = file:/Users/Simon/.m2/repository/org/codehaus/plexus/plexus-i18n/1.0-beta-7/plexus-i18n-1.0-beta-7.jar
[ERROR] urls[28] = file:/Users/Simon/.m2/repository/org/codehaus/plexus/plexus-velocity/1.1.7/plexus-velocity-1.1.7.jar
[ERROR] urls[29] = file:/Users/Simon/.m2/repository/org/apache/velocity/velocity/1.5/velocity-1.5.jar
[ERROR] urls[30] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-decoration-model/1.0-alpha-10/doxia-decoration-model-1.0-alpha-10.jar
[ERROR] urls[31] = file:/Users/Simon/.m2/repository/commons-collections/commons-collections/3.2/commons-collections-3.2.jar
[ERROR] urls[32] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-module-apt/1.0-alpha-10/doxia-module-apt-1.0-alpha-10.jar
[ERROR] urls[33] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-module-fml/1.0-alpha-10/doxia-module-fml-1.0-alpha-10.jar
[ERROR] urls[34] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-module-xdoc/1.0-alpha-10/doxia-module-xdoc-1.0-alpha-10.jar
[ERROR] urls[35] = file:/Users/Simon/.m2/repository/org/apache/maven/doxia/doxia-module-xhtml/1.0-alpha-10/doxia-module-xhtml-1.0-alpha-10.jar
[ERROR] urls[36] = file:/Users/Simon/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar
[ERROR] urls[37] = file:/Users/Simon/.m2/repository/org/sonatype/gossip/gossip/1.2/gossip-1.2.jar
[ERROR] Number of foreign imports: 1
[ERROR] import: Entry[import  from realm ClassRealm[project>org.apache.hbase.connectors:spark:1.0.1-SNAPSHOT, parent: ClassRealm[maven.api, parent: null]]]

When executing the spark-k8s application I'm currently receiving an error. The error looks like a system error (ARM/x86)

++ id -u
+ myuid=1000
++ id -g
+ mygid=0
+ set +e
++ getent passwd 1000
+ uidentry=stackable:x:1000:1000::/stackable:/bin/bash
+ set -e
+ '[' -z stackable:x:1000:1000::/stackable:/bin/bash ']'
+ '[' -z /usr/lib/jvm/jre-11 ']'
+ SPARK_CLASSPATH=':/stackable/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/stackable/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /stackable/spark/bin/spark-submit --conf spark.driver.bindAddress=10.244.1.38 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class tech.stackable.demo.spark local:////Users/Simon/Repo/stackable/stackablectl/demos/hbase-hdfs-load-cycling-data/sparkHbaseAccess/target/sparkHbaseAccess-1.0-SNAPSHOT.jar --hbaseSite /arguments/hbase-site.xml --tableName cycling-tripdata
qemu-x86_64: Could not open '/lib64/ld-linux-x86-64.so.2': No such file or directory

This ticket is on hold.
We need a strategy to get the hbase-spark-connector working with JAVA 11.
The current status is saved on this branch

  • build on top of the hbase-hdfs-cycling-demo
  • A job copying a .jar from our nexus to S3 (Minio)
  • Creating Secrets for access
  • Java Spark Application to simply scan a Hbase table
  • Mounting Hbase config to spark
  • TODO: Build with JAVA11 https://github.com/apache/hbase-connectors/tree/master/spark
  • TODO: Publish hbase-spark-connector.jar to our nexus and add it as dependency to pom for java project
  • TODO: Test, if hbase-spark-connector.jar needs to be distributed to hbase region servers or if it is not needed. (see configuration from repo)