/scipy2013

SciPy 2013 Data Processing Tutorial

Primary LanguagePython

SciPy 2013 Data Processing Tutorial

##Recap

Thank you to everyone for coming to our tutorial. We hope you all learned something new and useful, and encourage everyone to continue the lively discussions from the sessions throughout this week and beyond. Towards that aim of facilitating further discussion of these topics, here is a quick rundown of the topics we went over and some additional resources for those interested in learning more.

  • The tutorial GitHub repo contains the slides and exercises, and should stay up for a while.
  • Your demo accounts on Wakari.io are not permanent, but it's super easy to sign up for a free account. Wakari is in active development, so if there's a feature you want or an annoyance you don't, feel free to give us a shout!

Pandas

Data Exploration

(Unsupervised machine learning)

  • Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) find the axes with highest variance.
    • These high variance axes represent the "important" variables.
    • Implementations in Numpu/Scipy and SciKits-Learn
  • K-means clustering tries to group "similar" data points together.
    • The number of clusters K is an input parameter. This is good or bad depending on the problem.
    • Methods like Bayesian Information Content can help determine K from the data if it is unknown.
  • Exercises: PCA and K-means.
  • Recommended reading/examples:
  • Paper on PCA
  • Jake Vanderplas's GitRepo

IPCluster

  • IPCluster clients talk to a central controller, which in turn wrangles remote nodes, each running one or more engines.

  • An engine is like a thread. You can run an engine on the same node as a controller, and nodes can run more than one engine.

  • Configuration is flexible, but somewhat poorly documented. For development, run ipcluster start -n 3 to start three engines, and connect to them from IPython with

      from IPython.parallel import Client
      client = Client()
    
  • Execute commands with view methods e.g. direct.execute('foo()') not client.execute('foo()').

  • IPCluster is ideal for embarassingly parallel workloads that are CPU/GPU/RAM-heavy and light on data transfer.

  • Exercise: IPCluster Basics

  • Exercise: Bayesian Estimation w/ MCMC and IPCluster (view/clone this notebook with Wakari).

  • Recommended notebook: Introduction to Parallel Python with IPCluster and Wakari, Ian Stokes-Rees.

  • Recommended text: Doing Bayesian Data Analysis, John K Kruschke.

MapReduce