A short course on the new, experimental features by The Data Incubator and O'Reilly Strata. You can purchase the accompanying videos here on the O'Reilly website.
To run this tutorial, you need Apache Spark and Jupyter. You can install them:
- Download and install Apache Spark 2.0.0 by following the instructions here. You may first have to install Hadoop.
- Install Jupyter
pip install jupyter
To be able to run the interactive code cells, create a toree kernel:
jupyter toree install --spark_opts='--master=local[2] --executor-memory 4g --driver-memory 4g' \
--kernel_name=apache_toree --interpreters=PySpark,SparkR,Scala,SQL --spark_home=$SPARK_HOME
Otherwise, you can copy and paste the cells into a spark shell, which you can start by running
make spark-shell
To start the course, run
make notebook
and open the Overview.ipynb notebook. Note that you may be at a higher port number if 9000 is already in use.
If you want to play with Spark directly, you can also run
make spark-shell
Credits: The spark project template is based on https://github.com/nfo/spark-project-template