/DE

Data Engineering class in the Supaero Data and Decision Sciences program

Primary LanguageJupyter NotebookMIT LicenseMIT

Data Engineering

The amount of data in the world, the form these data take, and the ways to interact with data have all increased exponentially in recent years. The extraction of useful knowledge from data has long been one of the grand challenges of computer science, and the dawn of "big data" has transformed the landscape of data storage, manipulation, and analysis. In this module, we will look at the tools used to store and interact with data.

The objective of this class is that students gain:

  • First hand experience with and detailed knowledge of computing models, notably cloud computing
  • An understanding of distributed programming models and data distribution
  • Broad knowledge of many databases and their respective strengths

As a part of the Data and Decision Sciences Master's program, this module aims specifically at providing the tool set students will use for data analysis and knowledge extraction using skills acquired in the Algorithms of Machine Learning and Digital Economy and Data Uses classes.

Class structure

The class is structured in four parts:

Data engineering fundamentals

In this primer class, students will cover the basics of Linux command line usage, git, ssh, and data manipulation in python. The format of this class is an interactive capture-the-flag event.

Data storage

This module covers Database Management Systems with a focus on SQL systems. For evaluation, students will install and manipulate data in PostgreSQL and MongoDB and compare the two systems.

Data computation

A technical overview of the computing platforms used in the data ecosystem. We will briefly cover cluster computing and then go in depth on cloud computing, using Google Cloud Platform as an example. Finally, a class on GPU computing will be given in coordination with the deep learning section of the AML class.

Data distribution

In the final module, we cover the distribution of data, with a focus on distributed programming models. We will introduce functional programming and MapReduce, then use these concepts in a practical session on Spark. Finally, students will do a graded exercise with Dask.