- Set up two EC2 Ubuntu (16.04 or 18.04) instances with all default settings (note: 1CPU 1GB t2.micro is a free trial).
- Generate new or use already existing public key
- After succesfull initialazing one need to configure Security Groups (go to NETWORK&SECURITY) and add new or edit already existing rule.
- SSH, port: 22, source: anywhere
- HTTPS, port: default, source: anywhere
- TCP, port: 8888, source: anywhere
After succesfull initialazing EC2 instances one need to connect to them through SSH. Personally I do it everytime using Putty:
- Host name: ubuntu@public-DNS-of-you-ec2-machine
- Remember of generating a private key (you can use PuttyGen to generate a private key from a public one) and loading it in section (Connection -> SSH -> Auth -> Private key file ...)
- Set tunneling (Connection -> SSH -> Tunnels):
- source: 8000, destination: localhost:8888
- source: 8001, destination: localhost:8080
When in ubuntu terminal type following command:
$ wget http://ftp.man.poznan.pl/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
then unpack the file using tar xvzf file-name command. Then install Java and Scala:
$ sudo apt-get -y install openjdk-8-jdk-headless
$ sudo apt-get install scala
Now let's move to Apache Spark configurations and copy the spark-env.sh file from the template using below commands:
$ cd spark-2.4.4-bin-hadoop2.7/conf/
cp spark-env.sh.template spark-env.sh
Then use vim editor in order to paste below code at the beggining of the spark-env.sh file:
export JAVA_HOME=/usr export SPARK_PUBLIC_DNS=”current_node_public_dns”
When in current path, create a new 'slaves' file:
$ touch slaves
then use vim editor to paste public DNS of slave mode into the file.
Download latest version of Anaconda and use bash command to run installation script.
$ wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
$ bash Anaconda3–4.1.1-Linux-x86_64.sh
Jupyter certifiation configuration using below commands and answering several questions:
$ jupyter notebook --generate-config
$ mkdir certs
$ cd certs
$ sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
Last step is to configure the jupyter notebook, go to the jupyter_notebook_config.py and use vim editor.
$ cd ~/.jupyter/
$ vim jupyter_notebook_config.py
At ther very beggining of the file paste the following script:
c = get_config()
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
Run jupyter notebook simply by using command:
$ jupyter notebook
Jupyter panel is now available by entering following address in the browser:
https://public-dns-of-your-ec2-machine:8888/
Start master and slave nodes by running following commands. Starting slave node needs pasting spark master url at the end of the line.
$ bash/spark-2.4.4-bin-hadoop2.7/sbin/start-master.sh
$ bash/spark-2.4.4-bin-hadoop2.7/sbin/start-slave.sh spark://ip.[...].ec2.internal:7077
The spark master panel is available using localhost:8001 in the browser.
In order to stop nodes execute recent commans with changing 'start' to 'stop' in the command line.