This repository contains the code and resources for the course "Learning with Limited Labels using Weak Supervision and Uncertainty-Aware Training". The course goes deep into advanced techniques for training machine learning models effectively when labeled data is scarce or noisy, focusing on data-centric AI approaches, weak supervision methodologies, semi-supervised learning strategies, and annotation error detection mechanisms.
The course is structured into several modules, each focusing on specific aspects of learning with limited labels. Below is an overview of the topics covered, along with keypoints and takeaways from each module.
- Data-Centric AI paradigm
- Principles of Data-Centric AI
- Weak supervision techniques
- Types of weak supervision
- Aggregation of multiple labeling sources
- Semi-supervised learning approaches
- Self-training, co-training, and multi-view learning
- Label propagation
- Positive Unlabeled (PU) learning
- Elkan and Noto approach to PU learning
- Labeling Functions (LFs)
- Label Model
- Integration with Semi-Supervised Learning
- Snorkel Framework
- Evaluation metrics and comparison with fully supervised learning
- Named Entity Recognition (NER) using weak supervision
- Skweak Framework
- Document-Level Labeling
- Transfer Learning in NER tasks
- Iterative refinement of labeling functions
- Types of label noise
- Label noise transition matrix
- Retagging techniques
- Confident Learning for identifying mislabeled instances
- Confident Learning methodology
- Cleanlab library
- Application to various data modalities
- Extension to multi-label classification
- Handling model miscalibration
- Snorkel MeTaL
- Generative Model
- Flying Squid
- Dawid-Skene
- Hyper Label Model
- CrowdLab
- Influence functions for model interpretation
- Source-Aware Influence Functions
- Active Learning strategies
- Uncertainty sampling
- Query by Committee
- Diversity sampling
To get started with the course, ensure you have the following:
- Access to a Machine with a GPU: Recommended for computationally intensive tasks; alternatively, use Google Colab.
- Installation of Poetry: For managing Python dependencies. Install it here. (
pip install poetry
) - Weights & Biases Account: For experiment tracking and visualization. Sign up here.
Follow these steps to set up the environment and dependencies:
-
Clone the Repository:
git clone https://github.com/eliasjacob/datacentric_ai_course.git cd datacentric_ai_course
-
Install Dependencies:
-
For GPU support:
poetry install --sync -E cuda --with cuda poetry shell
-
For CPU-only support:
poetry install --sync -E cpu poetry shell
-
Authenticate Weights & Biases:
wandb login
This repository is configured to work with Visual Studio Code Dev Containers, providing a consistent and isolated development environment. To use this feature:
-
Install Visual Studio Code and the Remote - Containers extension.
-
Clone this repository to your local machine (if you haven't already):
-
Open the cloned repository in VS Code.
-
When prompted, click "Reopen in Container" or use the command palette (F1) and select "Remote-Containers: Reopen in Container".
-
VS Code will build the Docker container and set up the development environment. This may take a few minutes the first time.
-
Once the container is built, you'll have a fully configured environment with all the necessary dependencies installed.
Using Dev Containers ensures that all course participants have the same development environment, regardless of their local setup. It also makes it easier to manage dependencies and avoid conflicts with other projects.
Once the environment is set up, you can start exploring the course materials, running code examples, and working on the practical exercises.
- Some parts of the code may require a GPU for efficient execution. If you don't have access to a GPU, consider using Google Colab.
The course employs a top-down teaching method, starting with high-level overviews and practical applications before delving into underlying details. This approach helps maintain motivation and provides a clearer picture of how different components fit together.
- Hands-On Coding: Engage actively in coding exercises and projects.
- Explaining Concepts: Articulate your understanding by writing about what you've learned or helping peers.
You'll be encouraged to follow along with coding exercises and explain your learning to others. Summarizing key points as the course progresses will also be part of the learning process.
Your final project will be evaluated based on several criteria:
- Technical Quality: How well you implement the project.
- Creativity: The originality of your approach.
- Usefulness: The practical value of your project.
- Presentation: How effectively you present your project.
- Report: The clarity and thoroughness of your report.
- Individual Work: The project must be done individually.
- Submission: Submit a link to a GitHub repository or shared folder with your code, data, and report. Use virtual environments and
requirements.txt
to facilitate running your code. - Deadline: The project will be due 15 days after the end of the course.
- Submission Platform: Submit your project using the designated platform (e.g., SIGAA).
Contributions to the course repository are welcome! Follow these steps to contribute:
-
Fork the Repository: Click on the "Fork" button at the top right of the repository page.
-
Create a New Branch:
git checkout -b feature/YourFeature
-
Make Your Changes: Implement your feature or fix.
-
Commit Your Changes:
git commit -m 'Add some feature'
-
Push to the Branch:
git push origin feature/YourFeature
-
Create a Pull Request: Go to your fork on GitHub and click the "New pull request" button.
For any questions or feedback regarding the course materials or repository, you can contact me.