MDALTH is a modular library of active learning query algorithms and stopping methods.
Our goal is to provide modular, framework agnostic query algorithms and stopping methods for active learning. Our secondary goal is to provide high-level wrappers to these methods for huggingface users to make active learning highly accessible.
Current status: prelease
- AL learning loop
- AL evaluation loop
- basic query algorithms and accessbile wrappers for them
- basic stopping algorithms
- checkpointing system
- optional run with full dataset
- accessible stopping wrappers
- break AL loop
- dump data for large-scale evaluation
- validation set control
- query size control
- testing framework
- documentation
MDALTH allows users to prototype active learning experiments quickly. In the ./example
directory, we developed a very simple project to run active learning experiments for text, image, and audio classification tasks. By adjusting the command-line-arguments, this example can run experiments with many different pretrained models and datasets. We run our experiments with the following hyperparameters:
- AL initial size: 10%
- AL batch size: 5%
- Max Training Epochs: 32
- Early Stopping Patience: 3
The rest of the hyperparameters are detailed in the ./example/main.sh
script, which configures and runs the experiments.
- model: distilbert-base-uncased
- dataset: ag_news
Learning Curve Coming Soon!
- model: google/vit-base-patch16-224-in21k
- dataset: food101
Learning Curve Coming Soon!
- model: facebook/wav2vec2-base
- dataset: speech_commands
Learning Curve Coming Soon!
Install git and a package manager, e.g., conda on a Linux machine with a GPU (or a cluster with many GPUs, if you have one lying around).
I personally have had the most success using conda for core pytorch installs, then pip for other libraries because it is much faster.
conda create -n MDALTH python=3.11 pytorch-cuda=11.8 pytorch torchvision torchaudio torchtext -c pytorch -c nvidia
conda activate MDALTH
pip install transformers datasets tokenizers accelerate evaluate scipy scikit-learn matplotlib pandas librosa
Obviously, you need to conda activate
it each time before use. If you aren't interested in doing text, image, or audio classification, you don't need to install those dependencies (e.g., torchaudio and soundfile for audio classification).
We are working on improving the flexibility of our dependencies, but for now, see environment.yml for comprehenisve details about dependency requirements.
Next, clone the repository
git clone git@github.com:lkurlandski/MDALTH.git
To use mdalth components, you have two choices. You can either work within the MDALTH directory as is done in our examples, or you can pip install MDALTH directly into your environment. To do the latter,
pip install -e .
There may be some issues with cuda 11.8, discussed at Issue 97041. If you get a warning about convolutional layers, try the solution below:
cd ~/anaconda/envs/MDALTH/lib
sudo ln -sfn libnvrtc.so.11.8.89 libnvrtc.so
Several Python libraries for active learning have already been proposed, however, have significant disadvantages when compared to MDALTH. Notably, the community still lacks an open-source library of AL stopping methods, which are a crucial aspect of the AL pipeline.
ModAL wraps scikit-learn classifiers, and as such, is ill-suited for deep learning.
ALiPy is similar to ModAL and has the exact same shortcommings.
While designed to support deep learning through Pytorch, this repo is poorly engineered, documented, and maintained.
BADGE is forked from deep-active-learning. While it implements some newer querier algorithms, the repository is nowhere near capable of providing a modular toolkit for AL practitioners.
We are interested in collaborating! Message me on Github if you would like to get involved.
Please consider suggestions from pylint and the associated .pylintrc file. Autoformat with black --line-length=100.