Integration of Spark into the Cobald/Tardis system

Setup scripts and documentation to integrate Spark into the Cobald/Tardis system

Setup

git clone --recursive https://github.com/stwunsch/cobald-tardis-spark

The install.sh script installs the required Python and Java software.

cd cobald-tardis-spark/
./install.sh

Have a look at the config.sh file, set the correct configuration and run the configure.sh script.

./configure.sh

Adapt the config in hadoop-config/yarn-site.xml and set the number for yarn.nodemanager.resource.cpu-vcores to at least 2 and set the number for yarn.nodemanager.resource.memory-mb to at least 2500.
Go to the machine which should act as the master (aka resourcemanager in Yarn) and run

./run-resourcemanager.sh

Go to the machine which should act as the worker (aka nodemanager in Yarn) and run

./run-nodemanager.sh

./test-spark.sh