spark3-standalone-other: A Java repository from kevinfuture

TODO

Build fat jar with mvn package
Submit the applciation spark-submit --master spark://localhost:7077 --executor-memory 512M target/sparkwordcount-0.1.0-SNAPSHOT.jar

Solution: set executor memory to 512m instead of the default 1G: pyspark --master spark://localhost:7077 --executor-memory 512M

Exception: Python in worker has different version 3.7 than that in driver 3.8, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

Solution: use python3.7 for driver program

Install python3.7 if not installed already pyenv install 3.7.3
Create virtual env with python 3.7.3 pipenv --python 3.7.3 shell
launch pyspark in this virtual env pyspark --master spark://localhost:7077 --executor-memory 512M

How to setup a standalone Spark3 cluster.
How to connect local PySparkShell to standalone cluster.
Spark Session vs Spark Context. (Spark session is higher level abstraction, deal with structured dataset, Spark context is low level API, deal directly with RDD)
Python version for driver program and executor program need to be the same major version, otherwise executor throws exception.
More worker can be added from worker node by executing following command from worker node

bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out