Run spark in docker containers. For more details, checkout wiki
docker run --net=compose_default --rm -it -p 4040:4040 --name=spark-shell mangalaman93/spark-shell:2.1.0 --master mesos://zk://$HOST_IP:2181,$HOST_IP:2182,$HOST_IP:2183/mesos --conf spark.mesos.executor.docker.image=mangalaman93/executor:2.1.0 --conf spark.mesos.executor.home=/opt/spark --conf spark.ui.port=4040
val textFile = sc.textFile("/opt/spark/README.md")
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
docker run --net=compose_default --rm -it -p 4040:4040 -p 4042:4042 --name=dspark -e DSPARK_PORT=4042 -e DSPARK_LISTENING_IP=0.0.0.0 -e SPARK_MASTER=mesos://zk://$HOST_IP:2181,$HOST_IP:2182,$HOST_IP:2183/mesos -e spark_mesos_executor_docker_image=mangalaman93/executor:2.1.0 -e spark_mesos_executor_home=/opt/spark -e spark_local_dir=/opt/spark/tmp -e spark_ui_port=4040 -e spark_mesos_executor_docker_volumes=/opt/run/tmp:/opt/spark/tmp -e spark_jars=/opt/dspark/dspark.jar mangalaman93/dspark:0.1.0
docker run --rm -it -e HOST_IP=$HOST_IP --net=compose_default mangalaman93/hdfs:2.7.3 bash
hdfs dfs -fs=hdfs://$HOST_IP:9000 -put /anaconda-post.log /inputfile
curl -X POST -G http://$HOST_IP:4042/job --data-urlencode "input=hdfs://$HOST_IP:9000/inputfile" --data-urlencode "output=hdfs://$HOST_IP:9000/outputfile"
docker run --rm -it -e HOST_IP=$HOST_IP --net=compose_default mangalaman93/hdfs bash
hdfs dfs -fs=hdfs://$HOST_IP:9000 -ls /outputfile
hdfs dfs -fs=hdfs://$HOST_IP:9000 -cat /outputfile/part-00000
- Avoid running
shuffle
service in the host network mode - Avoid running
namenode
anddatanode
services in the host network mode - Setup an HA cluster of HDFS with at least 3 datanodes
- Spark-submit etc.
- Multinode setup
- Hostname look up issue with mesos
- Fix spark tmp directory as volume only when necessary
- Use different private and public IPs (not just
HOST_IP
)