/data-centric-deep-learning

Public repository for the "Data-Centric Deep Learning" course taught by Mike Wu and Andrew Maas. Available at https://corise.com/course/data-centric-deep-learning.

Primary LanguagePythonApache License 2.0Apache-2.0

Welcome to Data-Centric Deep Learning

Data-Centric Deep Learning (DCDL) is a four week class taught by Andrew Maas and Mike Wu on the Uplimit platform. This repository contains the open-sourced project material.

Course Description

Build, improve, and repair deep learning applications with a data-centric approach. Data is the key to success in modern machine learning, and this course provides hands-on experience with the impact of data quality, improving models via data, realistic performance evaluation, and human-in-the-loop data improvement methods. Learn best practices for achieving production-quality deep learning results, and how new technologies like pre-trained foundation models can make development faster and simpler. Understand how data-centric principles apply when developing LLM-based applications, agents, and retrieval-augmented generation (RAG) systems.

Instructor's Notes

The course is focused on a practical introduction to deep learning engineering and operations, with an emphasis on algorithmic challenges that practitioners face in the real world. To be "data-centric" means leveraging methods and tools that use data to improve, repair, and test deep learning models.

Students will walk through each step of a deep model's lifecycle, from annotation to training to testing to deployment to monitoring back to annotation. In each step, students will be introduced to new tools as well as the underlying methodology.

This class is an extremely hands-on project-driven course. Students will work with real data across images, speech audio, and natural language. Students will leverage state-of-the-art methods to achieve high performance, as well as break these models to analyze their shortcomings in practice.

In July '24, we have updated this course in light of recent advancements with large language models and the new ecosystem of data centric problems this new class of models present.

Class layout

This course will have four weekly projects. Each project will build on concepts from the prior week but have its own standalone components.

  • Week 1 will be completely in a colab notebook, so no code in this repository will be used. - Week 2 through 4 will each have their own folders in course/.
  • In each week's folder, you will at least one subfolder. Each subfolder is a project component. The weekly course page on Uplimit will guide you through the different subfolders.

Prerequisites

We expect students to be proficient in Python programming, and familiar with deep learning languages like PyTorch or Tensorflow. Students should have a basic understanding of machine learning and deep learning concepts. Optional knowledge of web applications may be beneficial.

Setup

These projects are best done through Github codespaces.

  1. Fork this repo (leave box checked for "Copy the main branch only")
  2. We recommend using Github Codespaces. If you click the green "Code" button at the top right of this page, you should be able to enter a dev environment for completing this assignment.