Create Spark application from notebookCreate PySpark job from local shelluse spark-submit to submit a python jobspark-submit to run a example java spark application- Run Event2S3 pipline on the docker cluster
- Build fat jar with
mvn package
- Submit the applciation
spark-submit --master spark://localhost:7077 --executor-memory 512M target/sparkwordcount-0.1.0-SNAPSHOT.jar
- How to view application detail UI: http://localhost:4041/jobs/
- Spark master allocates 0 cores to application:
Hints: https://stackoverflow.com/questions/51318426/spark-standalone-application-gets-0-cores
Solution: set executor memory to 512m instead of the default 1G: pyspark --master spark://localhost:7077 --executor-memory 512M
- Exception: Python in worker has different version 3.7 than that in driver 3.8, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Solution: use python3.7 for driver program
- Install python3.7 if not installed already
pyenv install 3.7.3
- Create virtual env with python 3.7.3
pipenv --python 3.7.3 shell
- launch pyspark in this virtual env
pyspark --master spark://localhost:7077 --executor-memory 512M
- How to setup a standalone Spark3 cluster.
- How to connect local PySparkShell to standalone cluster.
- Spark Session vs Spark Context. (Spark session is higher level abstraction, deal with structured dataset, Spark context is low level API, deal directly with RDD)
- Python version for driver program and executor program need to be the same major version, otherwise executor throws exception.
- More worker can be added from worker node by executing following command from worker node
bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out
Setup spark3 docker cluster: https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html Difference between DataFrame/Dataset vs RDD: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html Use python3 for pyspark on MacOS: https://stackoverflow.com/questions/30279783/apache-spark-how-to-use-pyspark-with-python-3