Coursework for MIDS Scaling Up! Really Big Data

This is an index of coursework for the MIDS class "Scaling Up! Really Big Data". Please submit corrections if you find problems in the assignments. Submissions should be well-formed git pull requests.

Week 1: Introduction

Labs

  1. Provisioning in SoftLayer

Week 2: Cloud Computing 101

Homework

  1. Working with Cloud Resources

Labs

  1. Salt States and Docker deployment of the ELK stack

Week 3: Openstack Introduction

Labs

  1. Hadoop over OpenStack DevStack using Sahara

Week 4: Distributed Filesystems

Homework

This is a graded homework

  1. Part 1- GPFS setup
  2. Part 2- The Mumbler

Labs

There will be no in-class lab for this assignment

Week 5: Map Reduce and Hadoop

Homework

  1. Hadoop Distributed Sort with YARN and HDFS

Labs

(Complete the following in order)

  1. Load Google 2-gram dataset into HDFS
  2. Preprocess 2-gram data for Mumbler

Week 6: Apache Spark

Homework

  1. Apache Spark Introduction

Labs

  1. Machine Learning with Spark and MLLib

Week 7: Object Storage

Homework

  1. Object Storage

Labs

(Complete the following in order)

  1. Data Transfer Performance
  2. Rsync Investigation

Week 8: NoSQL

Homework

  1. NoSQL

Week 9: Spark Streaming

Homework

  1. Streaming Tweet Processing

Labs

  1. Spark Streaming and Cassandra

Week 10: Scaling Up

Homework

  1. Orchestrate with Brooklyn

Labs

  1. Brooklyn labs

Week 11: Spark ML Round 2

(Homework-free week!)

Labs

  1. Streaming Analytics with AlchemyAPI

Week 12: Search

Labs

  1. Crawling the Web with Nutch, Indexing with Solr

Homework

  1. Elasticsearch

Week 13: Genomics

Homework

  1. Genomics with ADAM