/DS200

Architecture and Management of Large Datasets

MIT LicenseMIT

DS200 (Architecture for Management of Large Datasets)

This is the course web-page for Architecture for Management of Large Datasets being taught at IIT Bhilai, India in the Monsoon Semester of 2021.


Course Instructor: Dr. Gagan Raj Gupta

Other Instructors: Dr. Soumajit Pramanik, Dr. Subhajit Sidhanta

Teaching Assistants: Muttareddygari Sreechakra, Anirban Haldar

Canvas Link: https://canvas.instructure.com/courses/3570804

Motivation

Over the past few years, we have seen the emergence of "big data": disruptive technologies that have transformed commerce, science, and many aspects of society. These developments are enabled by infrastructure that allows us to distribute computations across hundreds or even thousands of commodity servers.

  • Getting data is becoming easier day by day, but we have too much to analyze (e.g. web, transactional data, text)
  • Data has errors of various types (missing, incorrect etc.), is incomplete and is hard to clean (e.g. user reviews/ratings, distorted images)
  • Data is usually high-dimensional (involving lot of columns or features) (e.g. text, images, videos, graphs)
  • Data usually has complex correlations and i.i.d. assumptions don't always work very well (e.g. graph data, time-series data)
  • Data is being generated at a great speed and it is too expensive to store all of it (e.g. user or machine transactions, queries)

In this course, we want to learn how large datasets are maintained and analyzed. If a single computer is not enough, how do we use multiple computers (even datacenters) to analyze large datasets? How do we make programming easy for data analysis and ML?

One key breakthrough that makes this all possible is the development of abstractions for data-intensive computing that allow programmers to reason about computations at a massive scale, hiding low-level details such as synchronization, data movement, and fault tolerance.

This course provides an introduction to big data infrastructure, starting with MapReduce, the first of these datacenter-scale programming abstractions. The Hadoop implementation of MapReduce lies at the core of an application stack that has gained widespread adoption in both industry and academia. A major focus of this course is algorithm design and "thinking at scale", applied to a variety of domains: text, graphs, relational data, etc. We will also cover a few next generation systems that are vying to replace MapReduce as the de facto big data processing platform of tomorrow.

Course Objectives

  • Motivate the need for managing large datasets.
  • Develop the architectural requirements for a data store (lake)
  • Introduce various distributed programming models and abstractions
  • Explain new paradigm of algorithm design with MapReduce for handling large datasets
  • Introduce streaming algorithms for processing streaming data
  • Provide hands-on experience to students in analyzing datasets in diverse fields (Industry 4.0, NLP, Graphs, Networks, Bio-informatics, Time-series)
  • Understand the software architecture

Pre-requisites

  • Basic knowledge of Python (most assignments will be based on Python)
  • Knowledge of basic computer science principles and skills

Tentative Course Outline

Lec # Date Topics covered in class Text Book Reference, readings
1 Sep 28 Large Datasets Examples; Data-center Architecture; Important of Analysis; Requirements on Architecture for Managing Large Datasets; How would you analyze large dataset?: Sequential vs. Parallel Programming; Higher levels of Abstraction for Parallel Programming: Datacenter is new computer; Data Intensive and Data Parallel Computing; MapReduce introduction DTP
2 Sep 30 Von-Neumann Model and current computers; Memory Hierarchy; Storage Technologies; Parallel Reads and Writes; Reliability and Cost Tradeoff with distributed file systems, latency and throughput; External Memory Algorithms: External Merge Sort References
3 Oct 1 Reliability Cost Tradeoff review; Streaming Model Introduction; Paralel Computing Models: synchronization; PRAM model: Computing Minimum with N/2 Processors via Tournament Method, Correctness, Solving with P<N/2 processors; BSP model; Map Reduce Explained; Demo of scalability via Map-Reduce on word-count problems References
4 Oct 5 Hash Functions, Aggregation of Data using Map Reduce, Optimizing MapReduce by using Combine and Partition, Conditions on using Combine, Example of Spotify and MailTrust, Word Count Optimized MRDP, MMDS 2.1
5 Oct 7 Examples of Aggregation: Count, Min, Max, Avergae, Sum, Median, Percentiles; Relational Algebra, SQL and Pandas Examples, Optimizing Percentile Calculations MRDP, MMDS 2.2
6 Oct 8 Tutorials on Input and Output of Map-Reduce; Regular Expressions, Shell Scripting; Map and Reduce in Python; Notion of efficiency of a parallel program; Compute Min (Comparison) efficiency; Amdahl’s law; References
7 Oct 12 Filtering patterns: Data cleaning, Bloom Filters for Set Membership MRDP
8 Oct 14 Filtering patterns: Top 10 , Distinct Items; Operations on Multiple Relations (Tables, Datasets): Union, Intersection, Difference, Joins; Matrix Multiplication Reservoir Sampling MRDP
9 Oct 21 Pipelining, Chaining, Bag operations, More patterns, WorkFlow Systems, Exam 1 Review/Prep HDG, MMDS Ch2
10-12 Oct 26,28,29 indexing with map-reduce, TF-IDF scoring with map-reduce, and Language models with map-reduce DTP
13,14 Nov 2,5 bfs/dfs, pagerank, random walk DTP
15,16,17 Nov 8-12 Spark Architecture and Programming MMDS, SDG
18 Nov 16 Intro to Spark ML SDG
HDFS, Yarn, Hadoop I/O HDG
Anatomy of MapReduce Job Run HDG
Pig, Hive, Zookeeper HDG

Meeting Times

Books/References/Practice materials

Similar Courses