Benchmarking label error detection algorithms for multi-label classification

Code to reproduce results from the paper:

Identifying Incorrect Annotations in Multi-Label Classification Data

This package is a DVC project that uses various datasets to evaluate different label quality scores for detecting annotation errors in multi-label classification. This repository is only intended for scientific purposes. To find label issues your own multi-label data, you should instead use the implementation from the official cleanlab library.

Instructions to get started

Clone the repo 2a [Optional]. Open the repo in a devcontainer 2b. Install the requirements with:

pip install -r requirements.txt

Run the pipeline with:

dvc repro

The pipeline has several stages:

$ dvc dag
              +--------------+
              | make_dataset |
              +--------------+
               ***          ***
              *                *
            **                  **
+------------------+          +-------+
| get_avg_accuracy |          | train |
+------------------+          +-------+
          *                        *
          *                        *
          *                        *
  +-------------+         +---------------+
  | group_stats |         | score_classes |
  +-------------+         +---------------+
                                   *
                                   *
                                   *
                            +-----------+
                            | aggregate |
                            +-----------+
                                   *
                                   *
                                   *
                           +--------------+
                           | rank_metrics |
                           +--------------+
                                   *
                                   *
                                   *
                           +--------------+
                           | plot_metrics |
                           +--------------+
+----------------+
| plot_avg_trace |
+----------------+

A description of each stage is given below.

$ dvc stage list
make_dataset      Create groups of datasets of different sizes & number of classes.
train             Train models and get out-of-sample predicted probabilities on the training sets.
get_avg_accuracy  Get model performance metrics on test sets, with and without label errors.
group_stats       Summarize model performance metrics for each group of datasets.
score_classes     Compute class label quality scores for each example in a dataset.
aggregate         Aggregate class label quality scores for all classes into a single score.
rank_metrics      Compute label error detection metrics for aggregated scores.
plot_metrics      Plot the label error detection and ranking metrics for the aggregated scores.
plot_avg_trace    Plot average traces of noise matrices used for noisy label generation.

The group_stats stage outputs two files in data/accuracies/:
- results_group.csv: All experimental results
- results_agg.json: Overall stats for the different aggregator methods.
The stages have variouus output files and directories. This is best viewed with dvc dag -o. Ignoring most of the intermediate files, the most relevant files are:
- data/accuracy/results_group.csv: Statistics of model performance metrics for each group of datasets.
- data/scores/results.csv: Class label quality scores for each example in each dataset.
- data/scores/metrics.csv: Statistics of label error detection and ranking metrics for each group of datasets.

Inspect the synthetic datasets in the notebooks/inspect_generated_data.ipynb notebook.
Inspect the results in the notebooks/inspect_score_results.ipynb notebook.

Aggregation methods to pool per-class annotation scores into an overall label quality score for each example

Along with the typical np.mean, np.median, np.min, np.max aggregators, we also implement several methods found in src/evaluation/aggregate.py:

softmin_pooling
log_transform_pooling
cumulative_average
simple_moving_average
exponential_moving_average
weighted_cumulative_average

CelebA analysis

See the Examples Notebooks in our examples repository for:

the Pytorch code we used to train a multi-label classifier model on CelebA
the code to find mislabeled images in this dataset

data/celeba/celeba_label_errors.csv in this repository contains: label quality scores for each image in the CelebA dataset and boolean is_issue column that indicates which images were identified to have a label issue by cleanlab

cleanlab/multilabel-error-detection-benchmarks

Benchmarking label error detection algorithms for multi-label classification

Instructions to get started

Aggregation methods to pool per-class annotation scores into an overall label quality score for each example

CelebA analysis