/zipfian-distribution

A self contained environment to do data science with {Python | Shell | R | Hadoop}. This is a Vagrant box built on Ubuntu 12.04 LTS

Primary LanguageRubyOtherNOASSERTION

The Zipfian Distribution

Zipfian Distribution

An Open Source development environment for getting up and running with data science quickly on any platform (scroll to quickstart if you are impatient)! This repository contains shell scripts to install neccessary packages and programs on Ubuntu 12.04 LTS, as well as libraries for doing data analysis in Python, R, and Hadoop. All libraries have been tested to play nice together and Python packages are installed in a virtualenv.

The Zipfian Distribution itself is a Vagrant box based on Ubuntu 12.04 LTS and meant to be a self contained environment runnable on any platform supported by Virtual Box (which is most all of them). This is the quickest way to get started and most stable. This is the recommended path of least resistence

The shell scripts are all written for Ubuntu and utilize apt-get, which makes them great to spin up machines in the cloud with Rackspace or AWS (or any other cloud provider). But this dependency severely limits the portability of them to any other OS. The Python and R libraries are installed using pip and CRAN, and should work on any platform.

Read Vagrant docs for a more advanced treatment of customizing the box.

This is a product of Zipfian Academy. If this is the kind of stuff that excites you, we are always looking for passionate people to be instructors, to give guest lectures or work with us on projects. If you would like to learn about data science in a faced paced environment with other awesome students, I encourage you to apply to our 12-week immersive program.

If you would just like to stay up to date with things, we do participate in Twitter @zipfianacademy and the Facebook.

NOTE: Python 2.7 is the default version for this distribution

Quickstart

Mac OSX

IMPORTANT: Run this command from the directory in which you want to download and initialize the Vagrantfile and associated VM. This should be in the root directory of your project/git repository. Only run this command once as it downloads large files. If you would like to use the Vagrant box for other projects, simply copy the Vagrantfile to root of those other projects.

bash <(curl -s http://zipfianacademy.com/downloads/zipfian-distribution/install/mac-osx.sh)

Once the script is done running you should have VirtualBox, Vagrant, and XQuartz installed and ready to start analyzing some data!

  1. Initialize the VM: vagrant up
  2. Login into the guest machine: vagrant ssh
  3. Change into the synced directory: cd /vagrant
  4. Make sure all your files are there: ls -larth
  5. Play!

Thats it! You should be logged into the guest machine and have all the python/R/Hadoop goodness at your fingertips... try ipython notebook

Windows

Coming Soon!

Linux

Coming Soon!

Dependencies

The only dependency for Vagrant is Virtual Box. So to get up and running with the Zipfian Distribution you will need to install these two things. That is all.

For graphical support (i.e. web browser, R studio, IPython notebooks) you will also need to support X11 forwarding.

To login to the VM you will also need ssh.

Windows

Mac OSX

What's in Here?!?

lib/

One line install scripts. Downloads and installs all dependencies (VirtualBox, Vagrant, X11) and then bootstraps the Vagrant VM. Currently only Mac OSX is supported.

platforms/

Scripts for specific platforms/OSes. Currently all that is support is shell scripts on Ubuntu 12.04 LTS.

  • bootstrap.sh: run all other script files, installs entire environment.
  • ubuntu.sh: install Ubuntu development libraries and packages.
  • python.sh: install Python packages with pip.
  • hadoop-ecosystem.sh: install Hadoop and associated ecosystem libraries.
  • r.sh: install R packages.

test/

This is where automated tests will go to validate the packages/install. There are currently none.

vagrant/

This is where Vagrant specific files go. Currently there is the base Vagrantfile for the Zipfian Distribution as well as the Box image.

Slow start

Download the Dependencies

Here!

Get the Vagrantfile

wget http://zipfianacademy.com/downloads/zipfian-distribution/vagrant/Vagrantfile

Bootstrap the VM

vagrant up

This Vagrant VM contains approximately XXX of additional files/libraries and takes around YYY minutes... now is a good time to grab a snack

SSH into your Machine!

vagrant ssh

NOTE: The Vagrant Box has X11 forwarding enabled, allowing you to run graphical applications (i.e. browser, IPython notebooks, R Studio, etc.) in the VM and have the windows run on the host machine

Synced Files

By default Vagrant syncs the host directory in which the Vagrantfile resides with the /vagrant folder on the guest VM. This is where you will be doing most of your work. If you would like to change this, modify the Vagrant file accordingly.

Vagrant

Customizing your Box!

Any changes to the configuration of your VM that are put in the Vagrantfile will overwrite the default box configuration.

Please see the official Vagrant documentation if you would like to customize the VM.

Package List

Ubuntu

Python

Scientific Packages

Utility packages

R

Coming Soon!

Hadoop Ecosystem

For the associated Hadoop components, we leverage Apache Bigtop to ease the installation process.

Road map (In no particular order)

  • Convert shell scripts to use Chef or Puppet for increased portability of the individual scripts.
  • Configure Hadoop ecosystem libraries to start on boot (HDFS, oozie, Hue, etc.)
  • Write documentation/tutorials on how to run specific non-standard library installs (i.e. Spark).
  • Add R packages and R studio
  • Add Cascading
  • Add Storm
  • Write automated tests for cross library compatibility
  • Create Homebrew package for single line install
  • Create Cygwin package
  • Create Linux packages for the common package managers

Contributing

Contributions are much appreciated and this repository is meant to be a living document. Open issues or submit pull requests if you have a favorite library I missed or a new platform to run this on.

  1. Fork it.
  2. Create a branch (git checkout -b my_zipfian)
  3. Commit your changes (git commit -am "Added The Biggest Data")
  4. Push to the branch (git push origin my_zipfian)
  5. Open a Pull Request
  6. Enjoy a some GIFs while waiting

Community

Keep track of development and community news.

Resources/References

Here are a list of great sites and tutorials that inspired this project:

Author

Jonathan Dinu

Copyright and license

Copyright (c) 2013 Zipfian, Inc. under the Apache 2.0 license.