Simple maven archetype to startup a Java-based Apache Spark project.
To create your project:
mvn archetype:generate \
-DgroupId=your.group \
-DartifactId=your_project \
-DarchetypeGroupId=fr.uha.ensisa.ff \
-DarchetypeArtifactId=spark-simple-archetype \
-DarchetypeVersion=0.0.8 \
-DinteractiveMode=false
To compile your project: mvn clean verify
To run locally (once compiled): java -jar target/your_project-*.jar
In case you're running a JDK ≥ 16 : java --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.util.concurrent=ALL-UNNAMED --add-opens java.base/java.net=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED -jar target/your_project-*.jar
You need to run commands using a bash command line (for windows user, use git bash).
First install Vagrant and VirtualBox.
The following fires a 3 node cluster (can be changed using the NODES environment variable).
git clone --depth=1 https://github.com/fondemen/docker-swarm-vagrant.git
cd docker-swarm-vagrant
vagrant up
until vagrant ssh node01 -c 'etcdctl cluster-health' ; do sleep 1; done
SWARM=on vagrant up
You might need to select an ethernet interface to connect to ; usually, the first one is best.
To re-use an previously started-then-stopped cluster
cd docker-swarm-vagrant
vagrant up
You might need to vagrant destroy -f
and retry in case the above fails.
Please, check the docker-swarm-vagrant project for more details on setting up the swarm cluster.
The easiest is to install the scp plugin: vagrant plugin install vagrant-scp
Compile you project as explained above
for m in $(vagrant status --machine-readable | grep ',state,' | cut -d, -f2); do vagrant scp ../your_project/target/your_project-*.jar $m:~/ana.jar; done
Not ready for production !
In case you use the Vagrant approach, login to the first node : vagrant ssh
.
docker network create -d overlay spark-nw
docker service create --name spark-master --replicas 1 --network spark-nw -e INIT_DAEMON_STEP=setup_spark -p 8080:8080 bde2020/spark-master:3.1.1-hadoop3.2-java11
docker service create --name spark-worker --mode global --network spark-nw -e "SPARK_MASTER=spark://spark-master:7077" --limit-memory 1280M bde2020/spark-worker:3.1.1-hadoop3.2-java11
The master (and only the master) is visible at http://192.168.59.99:8080. To see also workers activity, use a proxy such as Sqid :
docker service create --name squid --replicas 1 --network spark-nw -p 3128:3128 chrisdaish/squid
Then configure your browser to use an HTTP proxy server with host 192.168.59.99
and port 3128
.
In case you use the Vagrant approach, you must have uploaded your Spark program and logged in to node01 using vagrant ssh
.
We assume here that your Spark analysis is available on ./ana.jar.
Change the main class path in the following command (don't forget to change main class - [your.group.Main]):
docker service create --name ana -p 4040:4040 --network spark-nw --restart-condition none -e ENABLE_INIT_DAEMON=false -e SPARK_APPLICATION_JAR_NAME=ana -e SPARK_APPLICATION_JAR_LOCATION=/app/ana.jar -e SPARK_APPLICATION_MAIN_CLASS=[your.group.Main] --mount source=$(pwd)/ana.jar,target=/app/ana.jar,type=bind bde2020/spark-submit:3.1.1-hadoop3.2-java11
You can read your application driver logs: docker service logs -f ana
Once finished, delete the ana service: docker service rm ana
Exit the node01 node (exit
command).
vagrant destroy -f