Apache spark code to extract data from a JDBC relational database to HDFS
using MySql with some opendata example datasets CDH 5.8(local install on my laptop following https://github.com/krisgeus/ansible_local_cdh_hadoop
Inspired by blog of joe fallon: http://blog.joefallon.net/2013/10/install-mysql-on-mac-osx-using-homebrew/ brew install mysql mysql.server restart
DONT!!!mysql_secure_installation
DONT!!!unset TMPDIR
DONT!!!mysql_install_db --verbose --user=whoami
--basedir="$(brew --prefix mysql)" --datadir=/usr/local/var/mysql --tmpdir=/tmp
mysql -u root
create database sourcedb;
create user 'sourcedb'@'localhost' identified by 'sourcedb';
grant all on sourcedb.* to 'sourcedb'@'localhost';
exit
mysql -u sourcedb -D sourcedb -p
https://www.ntu.edu.sg/home/ehchua/programming/sql/SampleDatabases.html
download and extract sakila.tar.gz
git clone https://github.com/datacharmer/test_db.git
cd test-db
mysql -u root < employees.sql
mysql -u root
SOURCE /Users/kgeusebroek/dev/xebia/godatadriven/projects/sourcedbload/sampledb/sakila-db/sakila-schema.sql
SOURCE /Users/kgeusebroek/dev/xebia/godatadriven/projects/sourcedbload/sampledb/sakila-db/sakila-data.sql
grant all on sakila.* to 'sourcedb'@'localhost';
grant all on employees.* to 'sourcedb'@'localhost';
exit
In the notebooks directory we have a zeppelin notebook with the experimentation code
The following settings are needed to make this work:
<property>
<name>zeppelin.notebook.dir</name>
<value>${git clone dir of rdbms-to-hdfs}/notebooks/</value>
<description>path or URI for notebook persist</description>
</property>
Inside the notebook some specific spark interperter settings are mentioned. this mainly consists of adding the needed dependencies for making the jdbc connection.
Since we use mysql we use the mysql:mysql-connector-java:6.0.3
## Running the steps
### Step1a Full load of tables to parquet files on HDFS
spark-submit --class nl.krisgeus.jdbc.JdbcMain target/scala-2.10/rdbms-to-hdfs-assembly-0.1-SNAPSHOT.jar -s 0 -c employees -o /tmp/sourcedb
or
spark-submit --class nl.krisgeus.jdbc.JdbcMain target/scala-2.10/rdbms-to-hdfs-assembly-0.1-SNAPSHOT.jar --step 0 --config employees --output /tmp/sourcedb