The Zipfian Distribution
An Open Source development environment for getting up and running with data science quickly on any platform (scroll to quickstart if you are impatient)! This repository contains shell scripts to install neccessary packages and programs on Ubuntu 12.04 LTS, as well as libraries for doing data analysis in Python, R, and Hadoop. All libraries have been tested to play nice together and Python packages are installed in a virtualenv.
The Zipfian Distribution itself is a Vagrant box based on Ubuntu 12.04 LTS and meant to be a self contained environment runnable on any platform supported by Virtual Box (which is most all of them). This is the quickest way to get started and most stable. This is the recommended path of least resistence
The shell scripts are all written for Ubuntu and utilize apt-get
, which makes them great to spin up machines in the cloud with Rackspace or AWS (or any other cloud provider). But this dependency severely limits the portability of them to any other OS. The Python and R libraries are installed using pip and CRAN, and should work on any platform.
Read Vagrant docs for a more advanced treatment of customizing the box.
This is a product of Zipfian Academy. If this is the kind of stuff that excites you, we are always looking for passionate people to be instructors, to give guest lectures or work with us on projects. If you would like to learn about data science in a faced paced environment with other awesome students, I encourage you to apply to our 12-week immersive program.
If you would just like to stay up to date with things, we do participate in Twitter @zipfianacademy and the Facebook.
NOTE: Python 2.7 is the default version for this distribution
Quickstart
Mac OSX
IMPORTANT: Run this command from the directory in which you want to download and initialize the Vagrantfile and associated VM. This should be in the root directory of your project/git repository. Only run this command once as it downloads large files. If you would like to use the Vagrant box for other projects, simply copy the Vagrantfile to root of those other projects.
bash <(curl -s http://zipfianacademy.com/downloads/zipfian-distribution/install/mac-osx.sh)
Once the script is done running you should have VirtualBox, Vagrant, and XQuartz installed and ready to start analyzing some data!
- Initialize the VM:
vagrant up
- Login into the guest machine:
vagrant ssh
- Change into the synced directory:
cd /vagrant
- Make sure all your files are there:
ls -larth
- Play!
Thats it! You should be logged into the guest machine and have all the python/R/Hadoop goodness at your fingertips... try ipython notebook
Windows
Coming Soon!
Linux
Coming Soon!
Dependencies
The only dependency for Vagrant is Virtual Box. So to get up and running with the Zipfian Distribution you will need to install these two things. That is all.
For graphical support (i.e. web browser, R studio, IPython notebooks) you will also need to support X11 forwarding.
To login to the VM you will also need ssh.
Windows
Mac OSX
What's in Here?!?
lib/
One line install scripts. Downloads and installs all dependencies (VirtualBox, Vagrant, X11) and then bootstraps the Vagrant VM. Currently only Mac OSX is supported.
platforms/
Scripts for specific platforms/OSes. Currently all that is support is shell scripts on Ubuntu 12.04 LTS.
- bootstrap.sh: run all other script files, installs entire environment.
- ubuntu.sh: install Ubuntu development libraries and packages.
- python.sh: install Python packages with pip.
- hadoop-ecosystem.sh: install Hadoop and associated ecosystem libraries.
- r.sh: install R packages.
test/
This is where automated tests will go to validate the packages/install. There are currently none.
vagrant/
This is where Vagrant specific files go. Currently there is the base Vagrantfile for the Zipfian Distribution as well as the Box image.
Slow start
Download the Dependencies
Get the Vagrantfile
wget http://zipfianacademy.com/downloads/zipfian-distribution/vagrant/Vagrantfile
Bootstrap the VM
vagrant up
This Vagrant VM contains approximately XXX of additional files/libraries and takes around YYY minutes... now is a good time to grab a snack
SSH into your Machine!
vagrant ssh
NOTE: The Vagrant Box has X11 forwarding enabled, allowing you to run graphical applications (i.e. browser, IPython notebooks, R Studio, etc.) in the VM and have the windows run on the host machine
Synced Files
By default Vagrant syncs the host directory in which the Vagrantfile resides with the /vagrant
folder on the guest VM. This is where you will be doing most of your work. If you would like to change this, modify the Vagrant file accordingly.
Vagrant
Customizing your Box!
Any changes to the configuration of your VM that are put in the Vagrantfile will overwrite the default box configuration.
Please see the official Vagrant documentation if you would like to customize the VM.
Package List
Ubuntu
- R (R Studio optional):
R
- Python 2.7:
python
- Scala 2.9.2:
scala
- pip and easy_install
- Firefox:
firefox
- Chromium:
chromium-browser
- git and gitk:
git --help
- curl:
curl --help
- imagemagick:
convert --help
- SQLite3:
sqlite3
- Postgres:
psql-root
(for unadulterated root login) - MongoDB:
mongo
- Vim
- Emacs:
emacs
- screen
Python
Scientific Packages
- IPython 1.0.0
- numpy
- matplotlib
- scipy
- scikit-learn
- pandas
- statsmodels
- networkx
- nltk
- pymc
- patsy
- virtualenv
- virtualenvwrapper
Utility packages
R
Coming Soon!
Hadoop Ecosystem
For the associated Hadoop components, we leverage Apache Bigtop to ease the installation process.
- Apache Maven
- Apache Ant
- Apache Zookeeper 3.4.5
- Apache Flume 1.3.1
- Apache HBase 0.94.5
- Apache Pig 0.11.1:
pig -x local
- Apache Hive 0.10.0:
hive
- Apache Sqoop 2 (AKA 1.99.2)
- Apache Oozie 3.3.2
- Apache Whirr 0.8.2
- Apache Mahout 0.7: `mahout
- Apache Solr (SolrCloud) 4.2.1
- Apache Crunch (incubating) 0.5.0
- Apache HCatalog 0.5.0
- Apache Giraph 1.0.0
- LinkedIn DataFu 0.0.6
- Cloudera Hue 2.3.0:
hue
- Apache Spark 7.3:
spark-repl
- Cascading (Coming Soon)
- Pycascading (Coming Soon)
- Storm (Coming Soon)
Road map (In no particular order)
- Convert shell scripts to use Chef or Puppet for increased portability of the individual scripts.
- Configure Hadoop ecosystem libraries to start on boot (HDFS, oozie, Hue, etc.)
- Write documentation/tutorials on how to run specific non-standard library installs (i.e. Spark).
- Add R packages and R studio
- Add Cascading
- Add Storm
- Write automated tests for cross library compatibility
- Create Homebrew package for single line install
- Create Cygwin package
- Create Linux packages for the common package managers
Contributing
Contributions are much appreciated and this repository is meant to be a living document. Open issues or submit pull requests if you have a favorite library I missed or a new platform to run this on.
- Fork it.
- Create a branch (
git checkout -b my_zipfian
) - Commit your changes (
git commit -am "Added The Biggest Data"
) - Push to the branch (
git push origin my_zipfian
) - Open a Pull Request
- Enjoy a some GIFs while waiting
Community
Keep track of development and community news.
- Follow @zipfianacademy on Twitter.
- Join our email list on our website
- Read and subscribe to the The Zipfian Academy Blog.
- Have a question that's not a feature request or bug report? Email jonathan [AT] zipfianacademy [DOT] com
Resources/References
Here are a list of great sites and tutorials that inspired this project:
- Scientific Python on Mac OSX
- Installing the Scipy stack
- conda: think pip + virtualenv
- Kaggle: Getting started with Python for Data Science
- Data Community DC: Getting started with Python for Data Scientists
- Anaconda Packages: great list of relevant Python packages.
- Scipy Superpack: Script to build scientific Python libraries on OSX.
- Testing that it all works
- Apache Bigtop
- Apache Spark Guide
- R and RStudio in Ubuntu
- Putty Tutorial
- pydata-science.sh
Author
Jonathan Dinu
Copyright and license
Copyright (c) 2013 Zipfian, Inc. under the Apache 2.0 license.