1. You should have Java 8 installed, otherwise:
- sudo add-apt-repository ppa:webupd8team/java
- sudo apt-get update
- sudo apt-get install oracle-java8-installer
2. You should have Anaconda installed, otherwise:
3. You should have SPARK-SKLEARN packages installed, otherwise:
- pip install spark-sklearn
4. You should have MySQL installed, otherwise:
- sudo apt-get install mysql-server
5. You should have MySQL Connector/J JAR file available, otherwise:
6. You should have Spark 2.0 installed, otherwise:
7. If you'll be running Jupyter Notebook from an EC2 instance, you should follow these steps:
7.0 Make sure the Security Group associated with your EC2 instance has the following rules:
- SSH with source Anywhere
- HTTPS with source Anywhere
- Custom TCP Rule with Port 8888 and source Anywhere
7.1. Generate your own SSL Certificate
- mkdir certificates
- cd certificates
- openssl genrsa -out server.key 1024
- openssl req -new -key server.key -out server.csr
- openssl x509 -req -days 366 -in server.csr -signkey server.key -out server.crt
- cat server.crt server.key > server.pem
7.2. Create Jupyter Notebook config file
- jupyter notebook --generate-config
- cd ~/.jupyter
- vi jupyter_notebook_config.py
- c = get_config()
- c.IPKernelApp.pylab = 'inline'
- c.NotebookApp.certfile = '/home/ubuntu/certificates/server.pem'
- c.NotebookApp.ip = '*'
- c.NotebookApp.open_browser = False
- c.NotebookApp.port = 8888
1. Apache Spark - you have to add packages/jars so Spark can handle XML and JDBC sources
- cd /home/ubuntu/spark/conf
- cp spark-defaults.conf.template spark-defaults.conf
- vi spark-defaults.conf
- spark.jars.packages com.databricks:spark-xml_2.11:0.4.0
- spark.jars /home/ubuntu/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar
2. Environment Variables - you have to add this variables, so you can easily run PySpark as a Jupyter Notebook
- vi ~/.bashrc
- export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
- export SPARK_HOME="/home/ubuntu/spark"
- export PATH="$SPARK_HOME/bin:$SPARK_HOME:$PATH"
- source ~/.bashrc
3. PySpark - you have to install PYSPARK:
- cd DSR-Spark-Class
- pyspark OR nohup pyspark
1. Go for the EC2 menu
1.1 Click on Launch Instance
1.2 Look for the AMI ID ami-7aa74302 in Community AMIs (Oregon region)
1.3 When asked for, create a new key pair - download it and keep it safe!
1.4 When asked for, create a new security group with the following rules:
- SSH with source Anywhere
- HTTPS with source Anywhere
- Custom TCP Rule with Port 8888 and source Anywhere
1.5 After your instance is ready, you can SSH into it:
1.6 Update and then install Git
- sudo apt-get update
- sudo apt-get install git
1.7 Clone the repository and run PySpark, as in the "Class Materials" section