/slice-discovery-human-eval

Public repository for "Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms".

Primary LanguageJupyter Notebook

"Where Does My Model Underperform?" 🔎: Code & Data

A figure from the paper text, that displays different users' hypotheses for two example slices.

Public repository for the paper, "Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms", presented at AAAI HCOMP 2023.

Getting started. To run the code in this repository, first install the dependencies:

conda env create -f environment.yml
conda activate slicies
python -m ipykernel install --user --name slicies --display-name slicies
pip install git+https://github.com/openai/CLIP.git
pip install meerkat-ml==0.2.5

Data

We release (1) the images that belong to the slices we showed to users, and (2) information about users' hypotheses corresponding to these slices.

Slices shown to users

Each CSV file in the slices folder details which images in the MS-COCO dataset belonged to each slice that we showed users in our study.

  • Each row corresponds to an image belonging to our custom "test" split of MS-COCO (on which we ran the slice discovery algorithms).
    • The row will have a value of -1 for slice columns where it is not one of the top-20 images in the slice.
    • If an image does belong to a slice, then its slice column value will be set to a non-negative integer denoting its representativeness ranking in the slice (with value 0 being the "most representative").

The explore_slices notebook shows an example of how to read these CSV files, and visualize the images belonging to each slice.

Users' hypotheses

The user_hypotheses.csv spreadsheet contains the 180 total user hypotheses (created by 15 different users). 60 total unique slices were shown to 3 users, who each wrote down a different hypothesis corresponding to the slice (one hypothesis = one row).

Columns:

  • user_id is the unique ID of the user.
  • algorithm_condition is the algorithm condition (domino, ps (PlaneSpot), or baseline) for the slice shown to the user.
  • slice_no denotes which slice was shown to the user (where the slice IDs are consistent with those in the slices spreadsheets).
  • class_idx denotes the MS-COCO 2017 class that was. See our class-to-readable object name map here.
  • user_hypothesis contains the users' hypothesis. As described in the paper, the research team made slight modifications to the text that users wrote down (e.g., adding the phrase, "a photo of...") to use a consistent set of prompts for text-to-image retrieval (as part of hypothesis validation).
  • user_selected_fids contains the COCO 2017 numeric file IDs for the images that the user selected as matching their hypothesis, from the set of 20 images in the slice that they were shown.
  • contrastively_labeled_fids contains the COCO 2017 file IDs for the images that the research team obtained for hypothesis validation using a contrastive retrieval strategy (detailed in paper Appendix B). The file IDs in this list only include those that the researchers manually validated did "match" the hypothesis (see the labeling guide for further clarification).
  • individually_labeled_fids contains the COCO 2017 file IDs for the images that the research team obtained for hypothesis validation using an unmodified retrieval strategy, e.g., ranking images using their CLIP similarity score to the hypothesis text description only (detailed in paper Appendix B). Like above, the file IDs in this list only include those that the researchers manually validated did "match" the hypothesis.

Hypothesis labeling guide

See this spreadsheet for the criteria our research team used to determine whether an image matched each hypothesis text description.

Code

  • demo.ipynb shows how to use code that (1) creates a custom train-validation-test split of MS-COCO, (2) trains a model using this custom split, (3) runs each of the three slice discovery algorithms to compute slices, and (4) uses CLIP to retrieve the most similar images to a text description prompt (for hypothesis validation).
  • explore_slices.ipynb has demonstrations of how to work with our study data we've provided, e.g., how to load and process the spreadsheets in slices. It can be used to explore the slices that we showed users, and their corresponding hypotheses.