A Decade's Battle on Dataset Bias: Are We There Yet?

A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu and Kaiming He
Meta AI Research, FAIR
[arXiv] [code]


These images are sampled from three modern datasets: YFCC, CC, and DataComp. Can you specify which dataset each image is from? While these datasets appear to be less biased, we discover that neural networks can easily accomplish this “dataset classification” task with surprisingly high accuracy on the held-out validation set.

Answer (click) YFCC: 1, 4, 7, 10, 13; CC: 2, 5, 8, 11, 14; DataComp: 3, 6, 9, 12, 15.

Code

We use the code from ConvNeXt. Please follow the instructions there for setup.

Dataset Preparation

Download images from each dataset and organize them as follows:

/path/to/datasets_root/
  train/
    dataset1/
      ...
    dataset2/
      ...
    dataset3/
      ...
  val/
    dataset1/
      ...
    dataset2/
      ...
    dataset3/
      ...

Training

We give example commands for single-machine and multi-node training below.

Multi-node

python run_with_submitit.py --nodes 4 --ngpus 8 \
--model convnext_tiny --opt_betas 0.9 0.95 \
--batch_size 128 --lr 1e-3 --update_freq 1 \
--weight_decay 0.3 --reprob 0 \
--data_set image_folder --nb_classes 3 \ 
--data_path /path/to/datasets_root/train \
--eval_data_path /path/to/datasets_root/val \
--job_dir /path/to/save_results

Single-machine

python -m torch.distributed.launch --nproc_per_node=8 main.py \
--model convnext_tiny --opt_betas 0.9 0.95 \
--batch_size 128 --lr 1e-3 --update_freq 1 \
--weight_decay 0.3 --reprob 0 \
--data_set image_folder --nb_classes 3 \ 
--data_path /path/to/datasets_root/train \
--eval_data_path /path/to/datasets_root/val \
--output_dir /path/to/save_results

LICENSE

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

@article{liu2024decade,
  title   = {A Decade's Battle on Dataset Bias: Are We There Yet?},
  author  = {Zhuang Liu and Kaiming He},
  year    = {2024},
  journal = {arXiv preprint arXiv:2403.08632},
}