Automated Environment for Machine Learning with Python and Scala

The project is to create and deploy a Virtual Environment, which can be used for the machine learning with Python and Scala. The coding is done in the Jupyter Notebook locally.

The setup is intended to be as automatic as possible.

To provision the Virtual Machine(VM) with the packages for the machine learning task the Pupper was chosen.

The task was solved using two appoaches:

running calculations on the local machine with additional packages VirtualBox and Vagrant
utilizing the Cloud resources

In the case of greater time slot for the task, I could imagine to write Puppet code more carefully, making it more versatile, including different Linux distributions. Currently it works on Ubuntu/Trusty64. Another option to add is to create a mirrow for the packages and setup a proxy server if there is a risk for the public repositories are being unavailable.

Information for Data Engineers:

The Virtual Machine (VM) provisioning can be done in several ways. The Puppet scripts provided will work with the two following scenarios.

Scenario I. Local workstation as a host machine and VM is a guest.

One could use a local workstation with VirtualBox and Vagrant installed. (Vagrant must be from Vagrant site, not from the linux distribution repo!) Follow the instructions there. This setting was tested with Vagrant 1.8.6 and VirtualBox 5.1.8 on Ubuntu Xenial as host. VirtualBox installation instructions here.
Then in order to launch the VM the following code should be executed:

git clone https://github.com/raalesir/automated_environment.git That will clone the repo to the specified directory. The repo contains:

alexey@alexey-iMac:~/Projects/combient$ tree -L 3 manifests/ modules/
├── manifests
│   └── site.pp
├── modules
│   └── dependencies
│       └── manifests
├── README.md
└── Vagrantfile

The Vagrantfile from the repo should contain the following entries:

Vagrant.configure(2) do |config|
  config.vm.box = "ubuntu/trusty64"
  config.vm.provider "virtualbox" do |vb|
    vb.memory = "4096" # the more the better...
  end
 config.vm.provision "shell",
    inline: "sudo apt-get install -y puppet-common", 
 config.vm.provision "puppet" do |puppet|
    puppet.manifests_path = "manifests"
    puppet.module_path = "modules"
    puppet.manifest_file = "site.pp"
   end
end

vagrant box add ubuntu/trusty64
This will download the Vagrant box, so you can
vagrant up
That will provision the VM and install all dependencies according to the instructions.
After all the provisioning is finished, you can vagrant ssh and $ jupyter notebook. That would launch the Jupyter.
Make a desision about a firewall on the host machine. Either turn it off, or open ports at least 8887. The 8888 inside the VM should be open hopefully by default.
On the host machine open another terminal tab, go to the Vagrantfile directory and issue:
ssh -i .vagrant/machines/default/virtualbox/private_key -N -f -L localhost:8887:localhost:8888 -p 2200 vagrant@localhost, where:

.vagrant/machines/default/virtualbox/private_key is the SSH key created by Vagrant,
-p 2200 the port to ssh to the VM (it could be 2222, depending on your situation)
8888 and 8887 ports to launch Jupyter inside the VM and on the host machine correspondingly

Launch Web browser on the host machine and point it to http://localhost:8887
You should find yourself in the Jupyter GUI, so you can start start uploading the Notebook and source files.

I have used this approach until I reached line 4 or 5 in the Notebook. After that memory demands started to be too severe for my workstation. Use this approach only if you have 8-16 GB RAM. Otherwise switch to the Scenario II.

Scenario II. Using cloud resourses, like EC2.

Here you will use already existing VM at EC2. The EC2 use XEN as a virtualizer, so VirtualBox will not work there. It means it will not work to substitute the local workstation from the Scenario I with the EC2 node.

Luckily, we still can go with the repo you just cloned.

ssh -X -i ACE_Challenge.pem ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com
After logging in to /home/ubuntu type:
sudo apt-get install -y git && git clone https://github.com/combient/Challenge_Alexey_S.git
(That asks for a username and password. I guess that is because the repo is private...)
sudo apt-get install -y puppet-common
cd Challenge_Alexey_S && sudo puppet apply --modulepath=/home/ubuntu/Challenge_Alexey_S/modules manifests/site.pp
This will tell Puppet to apply the rules from its scripts onto the current machine, i.e. onto the EC2 node.
Copy the source files and the test notebook to the $HOME/Challenge_Alexey_S:
scp test.csv.gz train.csv.gz -i ACE_Challenge.pem ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com:/home/ubuntu/Challenge_Alexey_S or use Jupyter notebook GUI later.
Make a desision about a firewall. Either turn it off, or open ports at least 8888.
If $ env|grep SPARK_HOME is not set, logout from the VM and repeat step 1: exit and ssh -X -i ACE_Challenge.pem ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com
or execute source /bin/profile inside the VM.

If the jupyter installation went well, we can try launch the notebook:

ubuntu@ip-172-31-20-22:~$ jupyter notebook
That should bring you to some ASCII GUI.
make sure that the port 8887 is not listening i.e.: $ lsof -i :8887. Otherwise change it to someting above 1024 and test again.
If everything went well, the Jupyter server is running, and we would like to try to connect to it from the local machine. In order to do that one should use local port forwarding. The Jupyter by default will run at http://localhost:8888/ (inside the EC2 node), so we could use 8887 on our local machine.

ssh -i /home/alexey/Downloads/ACE_Challenge.pem -N -f -L localhost:8887:localhost:8888 ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com

Now we can open the brower on the localhost and enter: localhost:8887. That should bring us to the http://localhost:8887/tree#notebooks.
Upload a notebook if nesessary or create one if needed. Upload/delete data files with the GUI etc.

Information for Data Scientists:

Please estimate the memory consumption for the tasks, and provide that to the date engineers, so they know what resources to allocate. It would also be great if you would tell what packages will be needed with the versions, especially installed by pip. It could be a source of troubles, due to the potential differences in the systax between different versions and due to interpackage dependencies to be satisfied.

As soon as the infrastructure is ready, you should do the following:

Ask how to access the machine with the installed infrastructure i.e. something like:
ssh -X -i ACE_Challenge.pem ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com or vagrant ssh depending on the setup.
After login start the Jupyter:
ubuntu@ip-172-31-20-22:~$ jupyter notebook
That should bring you to some ASCII GUI. Just look at it.
From the another terminal tab on the local machine execute: ssh -i ACE_Challenge.pem -N -f -L localhost:8887:localhost:8888 ubuntu@ec2-52-212-62-56.eu-west-1.compute.amazonaws.com
or something like:
ssh -i .vagrant/machines/default/virtualbox/private_key -N -f -L localhost:8887:localhost:8888 -p 2200 vagrant@localhost depending on the setup
Start up your local Web Browser and point it to:
http://localhost:8887/tree#notebooks
The notebook has an intuitive GUI on how to create/modify files and notebooks.
One can upload source files, like test.csv.gz and train.csv.gz from the local machine to the EC2 node directly from the Jupyter Web-interface

raalesir/automated_environment

Automated Environment for Machine Learning with Python and Scala

Information for Data Engineers:

Scenario I. Local workstation as a host machine and VM is a guest.

Scenario II. Using cloud resourses, like EC2.

Information for Data Scientists: