-
Visit Apache Spark Downloads page
-
Select following options
- Choose a Spark release: 2.2.x or greater (I'll be using 2.2.1)
- Choose a package type: Pre-built for Apache Hadoop 2.7 and later
- Download Spark: spark-2.2.1-bin-hadoop2.7.tgz
Download that tar compressed file to your local machine.
-
After downloading the compressed file, unzip it to desired location:
$ tar -xvzf spark-2.2.1-bin-hadoop2.7.tgz -C ~/tools/spark-2.2.1/ -
Setting up the environment for Spark:
To set up environment variable:
Add following lines to your
~/.bashrcexport SPARK_HOME=/Users/nipunsadvilkar/tools/spark-2.2.1-bin-hadoop2.7 export PATH=$SPARK_HOME/bin:$PATH
Make sure you change the path in
SPARK_HOMEas per your spark software file are located. Reload your~/.bashrcfile using:$ source ~/.bashrc -
That's all! Spark has been set-up. Try running
pysparkcommand to use Spark from Python.
Two methods to do so.
-
Configure PySpark driver Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file.
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Restart your terminal and launch PySpark again:
$ pysparkNow, this command should start a Jupyter Notebook in your web browser.
-
Using
findsparkmodulefindSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.
To install findspark:
$ pip install findsparkIrrespective of Jupyter notebook/Python script all you need to do to use spark is add following line in your code:
import findspark findspark.init()