Just some simple Spark code to be built using the demo infrastructure and process
Implementation of a file based Change Data Capture flow. The flow can be run in local mode with default arguments running
class uk.co.pragmasoft.experiments.bigdata.spark.cdc.SampleCDC
or submitted into the spark cluster using the command line
spark-submit
--master <master>
spark-experiments_2.10-1.0.jar
--class uk.co.pragmasoft.experiments.bigdata.spark.cdc.SampleCDC
<root path to CDC files>
Where root path to CDC files
can be something like file:/Users/stefano/projects/spark-experiments/data/cdc
if your submit to a local cluster
or an HDFS path if you submit to an HDFS-enabled installation.
The CDC files directory is assumed to contain two files: fullData.csv
and newData.csv
. It will generate the output into the out
sub-folder
Application of the CDC flow with one of the source being an Oracle table.
The main class is uk.co.pragmasoft.experiments.bigdata.spark.dbimport.CustomCDCDataBaseImporter
and is as parameters:
rootPath
: is the root path to CDC files, the program is expecting to find thefullData.csv
file there. Defaults to./data/cdc
dbServerConnection
is Connection to the DB in the format(user/pwd)@//host(:port)
and defaults to"system/oracle@//192.168.59.103:49161
dbName
is the SSID of the DB to connect to. Defaults toxe
local
: if specified causes the spark context to be built in local mode
To test the application you can follow the procedure documented here:
The execution is assuming to have an Oracle instance available. I'm testing it using an oracle Docker instance. I'm using alexeiled/docker-oracle-xe-11g:latest
.
You can install the instance with
docker pull alexeiled/docker-oracle-xe-11g
and launch it with
docker run -d -p 49160:22 -p 49161:1521 -p 49162:8080
-v /Users/stefano/projects/spark-experiments/data:/test/data
-v /Users/stefano/projects/spark-experiments/src/test/resources:/test/cfg
alexeiled/docker-oracle-xe-11g
You can use the script src/test/scripts/start-db-docker.sh
The file src/main/resources/OracleDB.sql
contains the definition of the Customer table and the generation of some sample data.
Execute it with a SQL client connected with your DB.
I'm using SQLDeveloper connected with the configuration:
- Server: 192.168.59.103 (I'm running Docker with boot2docker so you can get the ip with the command
boot2docker ip
) - Port: 49161
- SSID: xe
- Username/pwd: system/oracle
You can also create the table using the command line:
docker exec -i -t <docker-id>
bash -c 'export ORACLE_HOME=/u01/app/oracle/product/11.2.0/xe;export ORACLE_SID=XE;echo exit | /u01/app/oracle/product/11.2.0/xe/bin/sqlplus system/oracle @/test/cfg/OracleDB.sql'
You can use the script src/test/scripts/init-db-docker.sh
and stop with src/test/scripts/stop-db-docker.sh
You need to add the oracle driver to your spark installation that means that every executor need to have the oracle driver ojdbc6.jar
in their classpath.
If you are using a standalone cluster a good way to do that is to edit the spark-defaults.conf
file adding the following two lines:
spark.executor.extraClassPath=/Users/stefano/projects/spark-experiments/extra-libs/ojdbc6.jar
spark.driver.extraClassPath=/Users/stefano/projects/spark-experiments/extra-libs/ojdbc6.jar
You can then submit the assembly jar to the cluster (in this example to a standalone cluster running on my laptop):
spark-submit
--master spark://galarragas.local:7077
--class uk.co.pragmasoft.experiments.bigdata.spark.dbimport.CustomerCDCDataBaseImporter
SparkExperiments-assembly-1.0.jar
--rootPath file:/Users/stefano/projects/spark-experiments/data/cdc
Remember to build the jar with sbt assembly
and not with just sbt package
.
This is my review test