Coursework for MIDS Scaling Up! Really Big Data
This is an index of coursework for the MIDS class "Scaling Up! Really Big Data". Please submit corrections if you find problems in the assignments. Submissions should be well-formed git pull requests.
Provisioning in SoftLayer
Week 2: Cloud Computing 101
Working with Cloud Resources
Salt States and Docker deployment of the ELK stack
Week 3: Openstack Introduction
Hadoop over OpenStack DevStack using Sahara
Week 4: Distributed Filesystems
This is a graded homework
Part 1- GPFS setup
Part 2- The Mumbler
There will be no in-class lab for this assignment
Week 5: Map Reduce and Hadoop
Hadoop Distributed Sort with YARN and HDFS
(Complete the following in order)
Load Google 2-gram dataset into HDFS
Preprocess 2-gram data for Mumbler
Apache Spark Introduction
Machine Learning with Spark and MLLib
Object Storage
(Complete the following in order)
Data Transfer Performance
Rsync Investigation
NoSQL
Streaming Tweet Processing
Spark Streaming and Cassandra
Orchestrate with Brooklyn
Brooklyn labs
Week 11: Spark ML Round 2
(Homework-free week!)
Streaming Analytics with AlchemyAPI
Crawling the Web with Nutch, Indexing with Solr
Elasticsearch
Genomics with ADAM