Data-science

Notes for configuring AWS for use with various data science tools on Ubuntu trusty installation.

Ubuntu configuration

The configuration is Ubuntu trusty 10.4 on a EC2 instance.

Ubuntu startup scripts

The .bashrc file is used to make a number of changes on startup. These changes are:

  1. Switch to the correct python environment
  2. start the Jupyter notebook server
  3. start the rodeo server

.bashrc is located in the ubuntu home directory /home/ubuntu/.bashrc

#Switches to the correct anaconda environment
source activate [environmentName]

#start jupyter notebook server
jupyter notebook

#start rodeo
rodeo . host=0.0.0.0 port=8000

Rstudio Server

Follow the installation steps on the RStudio Server website. In particular note the section regarding public keys to access the repositories. This must be done before the install or upgrade commandes will work.

Expose the Rstudio server ports by modifying the AWS security group that the instance sits on.

Create at least one valid user for RStudio server using the normal approach for linux. Note that it is important for the user to be given a home directory or logging in to RStudio server will fail.

#Create the user with a home directory
$sudo useradd -d /home/[username] -m [username]

#create a password for the user
$sudo passwd [username]

Increase the RAM for RStudio server or increase swapfile

The best solution is to use a larger AWS instance with more RAM rather than using a swap file which will be far slower.

These instructions were necessary to increase memory and get around issues with RStudio being unable to allocate sufficent memory to install knitr or load certain packages. Instructions are based on the instructions found here: https://www.digitalocean.com/community/tutorials/how-to-add-swap-on-ubuntu-12-04

 # check to see if there is an existing swap
 $ sudo swapon -s

 # make sure we have enough disk space free, need at least 256k
 $ df -h

 # make the swap.  this will only last until the machine is rebooted
 $ sudo /bin/dd if=/dev/zero of=/swapfile bs=1024 count=256k
 $ sudo /sbin/mkswap /swapfile

 # results in:
 # Setting up swapspace version 1, size = 262140 KiB
 # no label, UUID=2193e5eb-0482-420c-9a7d-53558084fd06

 $ sudo /sbin/swapon /swapfile

 # confirm that you can see it:
 $ swapon -s

 # To turn off the swap, run:
 $ sudo /sbin/swapoff /swapfile

 # To make it use this swap every time the machine is started
 # Add this to /etc/fstab:
  /swapfile swap swap defaults 0 0
 
 $ sudo chmod 666 /etc/fstab
 $ vim /etc/fstab
 
 

This wasn't enough to install knitr, so I increased it to 1GB of swap, then it worked.

 # To resize the swap to 1GB instead of 256Mb
 $ sudo /sbin/swapoff /swapfile
 $ sudo rm /swapfile
 $ sudo /bin/dd if=/dev/zero of=/swapfile bs=1M count=1024
 1024+0 records in
 1024+0 records out
 1073741824 bytes (1.1 GB) copied, 29.6197 s, 36.3 MB/s
 $ sudo /sbin/mkswap /swapfile
 Setting up swapspace version 1, size = 1048572 KiB
 no label, UUID=17f490d4-4188-4eaa-84aa-3ea0fd62bfe4
 $ sudo swapon /swapfile

Swappiness in the file should be set to 10. Skipping this step may cause both poor performance, whereas setting it to 10 will cause swap to act as an emergency buffer, preventing out-of-memory crashes.

You can do this with the following commands:

$ echo 10 | sudo tee /proc/sys/vm/swappiness
$ echo vm.swappiness = 10 | sudo tee -a /etc/sysctl.conf

To prevent the file from being world-readable, you should set up the correct permissions on the swap file:

$ sudo chown root:root /swapfile 
$ sudo chmod 0600 /swapfile

Additional configuration for Geospatial analysis

Geospatial analysis and mapping using rgdal requires the installation of a number of extra unix libraries.

sudo apt-get install libgdal-dev
sudo apt-get install libproj-dev

Useful tools for geospatial analysis and mapping:

  • rgdal
  • ggmap
  • rgeos
  • maptools
  • tmap - thematic maps
  • spatstat

Configuring Python

  1. Python
  2. IPython notebooks and jupyter
  3. rodeo for a similar kind of interface to Rstudio

Configuring python webserver