Data Science box (DSbox)
This is a Linux (Ubuntu) box deployed by vagrant including the following Data Science apps:
- Spark 1.5.2: one master node and up to 9 slaves.
- Jupyter 4.0.6 (IPython 4.0.1): kernels for Python 2 & 3, R, and Scala 2.10. It also includes RISE, test_helper, and IPython-extensions.
- Python 2 and 3.
- R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree".
- RStudio Server v0.99.491.
- Java JDK 7 (1.7.0_91).
- Scala 2.10.
- Zeppelin 0.5.5.
- scikit-learn 0.17: for Python 2 and 3.
- TensorFlow 0.6.0: for Python 2 and 3, but ONLY for 64-bit systems.
It has been succesfully tested on both ubuntu/trusty32
and ubuntu/trusty64
systems.
Pre-deployment steps
To install the box, you must follow the next steps:
- Install VirtualBox: if you use any other provider, you must change the
provider
parameter in the Vagrantfile. - Install Vagrant.
- Install Git.
- Clone this repository to a specific folder:
$ git clone https://github.com/mcolebrook/dsbox.git <YOUR_BOX_FOLDER>
Remark: some Windows users told me that they got line-ending issues (see GitHub Help) after cloning and starting up the box. To fix this problem, and BEFORE CLONING the box, just type:
$ git config --global core.autocrlf input
Config parameters
Go to <YOUR_BOX_FOLDER>
, and edit the Vagrantfile
to change the parameters:
Parameter | Description | Default value |
---|---|---|
provider | VM provider | "virtualbox" |
boxMaster | OS in master node | "ubuntu/trusty32" |
boxSlave | OS in slave nodes | ubuntu/trusty32 |
masterRAM | Master's RAM in MB | 3072 |
masterCPU | Master's CPU cores | 2 |
masterName | name of the master node used in scripts/spark-env-sh |
"spark-master" |
masterIP | private IP of master node | "10.20.30.100" |
slaves | # of slaves | 2 (max 9) |
slaveRAM | Slave's RAM in MB | 2048 |
slaveCPU | Slave's CPU cores | 2 |
slaveName | base name for slave nodes | "spark-slave" |
slavesIP | base private IP for slave nodes | "10.20.30.10" |
IPythonPort | IPython/Jupyter port to forward (set in Jupyter/IPython config file) | 8001 |
SparkMasterPort | SPARK_MASTER_WEBUI_PORT | 8080 |
SparkWorkerPort | SPARK_WORKER_WEBUI_PORT | 8081 |
SparkAppPort | Spark app web UI port | 4040 |
RStudioPort | RStudio server port | 8787 |
ZeppelinPort | Zeppelin default port is 8080 -> conflict with Spark | 8888 |
SlidesPort | jupyter-nbconvert <file.ipynb> --to slides --post serve |
8000 |
Starting up and shutting down the cluster
You have several ways to start up the cluster.
Deploy the master and all the slaves
To deploy the cluster with one master node and two slave nodes by default:
$ vagrant up
Bear in mind that the whole process (bringing master+slaves up and the provisioning) may take several minutes!! On my Intel Core i7-4790 CPU (4 cores @ 3.60GHz) with 32 Gb RAM, I got the following times:
Master
==> spark-master: END provisioning 2016/**/** **:**:**
==> spark-master: TOTAL TIME: 788 seconds
Slaves
==> spark-slave-1: END provisioning 2016/**/** **:**:**
==> spark-slave-1: TOTAL TIME: 228 seconds
Deploy only the master
In case you only want to deploy the master node:
$ vagrant up spark-master
Halt the cluster
To shutdown the whole cluster:
$ vagrant halt
Halt only the master node
If you only want to halt the master node:
$ vagrant halt spark-master
Delete the whole cluster (master + slaves)
In case you want to delete the whole cluster:
$ vagrant destroy
Start/Stop Spark
To start up the Spark cluster (master + slaves):
$ vagrant ssh spark-master
...
$ $SPARK_HOME/sbin/start-all.sh
You can also start the cluster up from the host machine by typing:
$ vagrant ssh spark-master -c "bash /opt/spark/sbin/start-all.sh"
To halt the cluster, just run stop-all.sh
.
Remember that you can access Spark info in the following ports:
Starting Jupyter
The best way to start the Jupyter notebook is the following:
$ vagrant ssh spark-master
...
$ cd /vagrant/jupyter-notebooks
$ jupyter-notebook
Inside the folder jupyter-notebooks
you may find some sample notebooks.
Then, go to your favorite browser and type in localhost:8001
.
Besides, you can also start the Jupyter notebook with pyspark
as the default interpreter by using the script scripts/start-pyspark-notebook.sh
.
Remember that inside the Jupyter notebook you can:
- Code your scripts in Python 2, Python 3, R, and Scala 2.10.
- Use RISE, test_helper, and IPython-extensions.
To stop the notebook, just press the keys Ctrl+C
.
Starting RStudio
The RStudio Server daemon should be alreaday running in the background, so you only have to type in your browser localhost:8787
. In order to work with Spark, you have to run the commands inside the config.R
script. You may find helpful this RStudio cheat sheet.
Installing Zeppelin
I recommend you to build Zeppelin aside from the provision of the master node, since it takes a long time to complete the compilation. Thus, you can run the following lines, and wait until all modules are built.
$ vagrant ssh spark-master
$ cd /vagrant/scripts
$ sudo ./60-zeppelin.sh
Once all the modules are compiled inside the spark-master
node, you can start Zeppelin typing:
$ sudo env "PATH=$PATH" /opt/zeppelin/bin/zeppelin-daemon.sh start
Remeber to use the same command with 'stop' to halt the daemon. Alternatively, you can run the script directly from the host machine by means of:
$ vagrant ssh spark-master -c "bash /opt/zeppelin/bin/zeppelin-daemon.sh start"
Finally, to start working with Zeppelin you may use the notebooks inside the folder /vagrant/zeppelin_notebooks
.
scikit-learn
and tensorflow
Installing You may install these two libraries running the following lines:
$ vagrant ssh spark-master
$ cd /vagrant/scripts
$ sudo ./61-scikit-learn-tensorflow.sh
Remember that Tensor Flow is available for 64-bit systems only.
License
GNU. Please refer to the LICENSE file in this repository.
Acknowledgements (in alphabetical order)
Thanks to the following people for sharing their projects: Adobe Research, Damián Avila, Dan Koch, Felix Cheung, Francisco Javier Pulido, Gustavo Arjones, IBM Cloud Emerging Technologies, Jee Vang, Jeffrey Thompson, José A. Dianes, Maloy Manna, NGUYEN Trong Khoa, and Peng Cheng.
Thanks also to the following people for pointing me out some bugs: Carlos Pérez-González, Christos Iraklis Tsatsoulis.