These are codes and data made for or provided by the Spark And Python For Big Data With Pyspark course by Pierian Data / Jose Portilla at Udemy.
Please respect the Honor Code by not copying these solutions. These files are as a backup and personal reference whenever and wherever I need them, and to showcase technical development.
For all other purposes, files created by myself are under the MIT License.
TODO: Put the instructions here
- Install Virtual Box.
- Get .iso for
ubuntu-18.04.1-desktop-amd64
, it is the LTS version. - Once it's finalized installing, do:
$ sudo apt-get update
$ sudo apt-get upgrade
- Set up the git repository with SSH key access:
$ sudo apt-get install git
- Read https://docs.gitlab.com/ee/ssh/#adding-an-ssh-key-to-your-gitlab-account
$ git clone git@gitlab.com:zubie7a/Spark_And_Python_For_Big_Data
$ git config --global user.name 'Santiago Zubieta'
$ git config --global user.email 'santiago.zubieta@gmail.com'
- In top-bar select "Insert Guest Additions CD Image"
- Install, then make Shared Clipboard and Drag and Drop be bidirectional.
- Create a Shared Folder for ease of file transfer.
- To be able to access the Shared Folder without giving password always, do:
$ sudo add user zubieta vboxsf
- Check Python3 version (Ideally 3.6.5)
$ sudo apt-get install python3-pip
- Check Pip version with
$ pip3 -V
- Install Jupyter
$ pip3 install jupyter --user
- Install Pipenv
$ pip3 install pipenv --user
(just in case).
- Install Py4j
$ pip3 install py4j --user
- Install Java
$ sudo apt-get install openjdk-8-jre
- Remember that it must be Java 8 for all of this to work.
- The path will be
/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java
$ java -version
:'openjdk version "1.8.0_181"
- Install Scala
$ sudo apt-get install scala
$ scala -version
:'Scala code runner version 2.11.12'
- Download Spark:
spark-2.3.1-bin-hadoop2.7.tgz
- Decompress it
$ sudo tar -zxvf spark-2.3.1-bin-hadoop2.7
- Give permissions to all necessary folders:
$ sudo chmod 777 spark-2.3.1-bin-hadoop2.7/
$ cd spark-2.3.1-bin-hadoop2.7/
$ sudo chmod 777 python/
$ cd python/
$ sudo chmod 777 pyspark/
- Add to PATH the pip local path and env variables for Spark
-
export PATH="~/.local/bin:$PATH" export SPARK_HOME='/home/zubieta/spark-2.3.1-bin-hadoop2.7' export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH export PYSPARK_DRIVER_PYTHON='jupyter' export PYSPARK_DRIVER_PYTHON_OPTS='notebook' export PYSPARK_PYTHONPATH=python3 export PYSPARK_PYTHON=python3 export JAVA_HOME='/usr/lib/jvm/java-1.8.0-openjdk-amd64/' export PATH=$SPARK_HOME:$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
-
- Open a notebook with
$ jupyter notebook
- Create a new Jupyter Notebook:
- Try importing pyspark inside the notebook:
import pyspark
- If doesn't work you need to be inside:
*
spark-2.3.1-bin-hadoop2.7/python/
- Else, you need FindSpark to be able to import Spark from elsewhere.
- Install FindSpark:
$ pip3 install findspark --user
- To find spark, do this in the Python notebooks:
-
import findspark findspark.init('/home/zubieta/spark-2.3.1-bin-hadoop2.7') import pyspark
-