Prototypes a Spark-HBase-Database app without Spark read-write connectors. Specifically, this app has no Spark connectors to read-write vis-a-via HBase or a JDBC store. And while JDBC read-write connector support ships with Spark 2.4, updates are not supported! Hence the challenge of building a true Spark clusterable app without Spark read-write connectors.
Spark task serialization issues are a challenge, to put it mildly. Earlier versions of this app relied too much on a task closure accessing external hbase and h2 proxies. The current implementation temporarily creates a pre-Spark session hbase and h2 proxy. Only after all pre-Spark session work has been completed, will a Spark session be created. A Dataset is then created from a sequence of pre-scanned HBase row keys. Then an hbase and h2 proxy are created within the Spark task closure, with the intention that all hbase and h2 code will execute on a Spark worker node. Only pre-Spark session code should execute on the Driver client.
- Create HBase and H2 proxies.
- Create HBase key-value table.
- Put key-value pairs into HBase key-value table.
- Scan HBase key-value table for all row keys.
- Create H2 key-value table.
- HBase and H2 proxies are destroyed by GC.
- Create HBase and H2 proxies.
- Create Spark session.
- Create Dataset from sequence of HBase row keys.
- Foreach row key Get Json value via HBase client.
- Convert Json value to Scala object.
- Insert Scala object into key-value H2 table.
- Update Scala object in H2 key-value table.
- HBase and H2 proxies are destroyed by GC.
- Spark session is closed.
Normally I would use Homebrew to install, start and stop HBase. But, in this case, I strongly recommend following this guide: http://hbase.apache.org/book.html#quickstart
If useful, consider adding an export $HBASE_HOME/bin to your export $PATH entry.
- hbase/bin$ ./hbase shell
- hbase/bin$ ./start-hbase.sh
- sbt clean compile run
- hbase/bin$ ./stop-hbase.sh
- HBase: http://localhost:16010/master-status
- Spark: http://localhost:4040
- Control-C
- ./target/app.log