/you-only-condense-once

You Only Condense Once: Two Rules for Pruning Condensed Datasets (NeurIPS 2023)

Primary LanguagePython

You Only Condense Once (YOCO)

[Paper] [BibTeX]

Alt text

On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules, Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive.

Getting Started

First, download our repo:

https://github.com/he-y/you-only-condense-once.git
cd you-only-condense-once

Second, create conda environment: The code has been tested with Pytorch 1.11.0 with Python 3.9.15.

# create conda environment
conda create -n yoco python=3.9
conda activate yoco

Third, install the required dependencies:

pip install -r requirements.txt

Our code is mainly based on two repositories:

Main Files of the Repo

  • get_training_dynamics.py trains a model and track the training dynamics based on condensed datasets.
  • generate_importance_score.py generate importance score according to the stored training dynamic files.
  • utils/img_loader.py loads condensed datasets with target IPC according to the pre-computed importance scores.

Module 1: Condensed Dataset Preparation (Google Drive File)

The condensed datasets used in our experiments can be downloaded from google drive. The downloaded datasets should follow below file structure:

YOCO
- raid
  - condensed_img
    - dream
    - idc
    - ...

condense_key in below table denotes condensed datasets obtained by which method are evaluated. Our experiment results are mainly tested on IDC, so default setting is condense_key = idc.

condense_key Description
idc Dataset Condensation via Efficient Synthetic-Data Parameterization (IDC)
dream Efficient Dataset Distillation by Representative Matching (DREAM)
mtt Dataset Distillation by Matching Training Trajectories (MTT)
dsa Dataset Condensation with Differentiable Siamese Augmentation (DSA)
kip Dataset Distillation with Infinitely Wide Convolutional Networks (KIP)

If you want to condense by yourself, run:

python condense.py --reproduce_condense  -d [dataset] -f [factor] --ipc [images per class]

Module 2: Pruning the Condensed Datasets via Three Steps (Google Drive File)

Step 1: Generate the training dyanmics from the condensed dataset (or you can directly downloaded our generated training dynamics here):

python get_training_dynamics.py --dataset [dataset] --ipc [IPCF] --condense_key [condensation method]

Step 2: Generate the score file for each image according to the training dynamic:

python generate_importance_score.py --dataset [dataset] --ipc [IPCF] --condense_key [condensation method]

Step 3: Evaluate the performance using different dataset pruning metrics

python test.py -d [dataset] --ipc [IPCF] --slct_ipc [IPCT] --pruning_key [pruning method] --condense_key [condensation method]

pruning_key denotes different dataset pruning methods including:

pruning_key Description Prefer hard/easy? Balanced?
random Random Selection N/A no
ssp Self-Supervised Prototype hard no
entropy Entropy hard no
accumulated_margin Area Under the Margin hard no
forgetting Forgetting score hard no
el2n EL2N score hard no
ccs Coverage-centric Coreset Selection easy no
yoco Our method easy yes
  • Prefer hard/easy? means the method prefer hard samples or easy samples.
  • Balanced means the method consider balanced or not.
  • ccs prunes hard images identified by el2n score (in our implementation).

To alter the components for each metric, we can append following suffixes after pruning_key:

suffix explanation
_easy / _hard Whether to use easy / hard samples
_balance / _imbalance Whether to have balance / imbalance class distribution

For example, default forgetting metric is equivalent to forgetting_hard_imbalance, prefer hard and not balanced.

  • Changes to forgetting_easy to prefer easy.
  • Changes to forgetting_balance to construct balanced samples.
  • Changes to forgetting_easy_balance or forgetting_balance_easy to prefer easy + balanced.

Table reproducing

For the ease of reproducing experiment results, we provide the bash shell scripts for each table. The scripts can be found in scripts\table[x].sh. The training dynamics and scores used in our experiments can be downloaded from google drive. Note: the training dynamics contains large files (e.g., idc/cifar100 is ~6GB).

The downloaded files should follow below file structure:

YOCO
- raid
  - reproduce_*
      - dynamics_and_scores
        - idc
        - dream
        - ...
  - condensed_img (download from Module 1)
    - idc
    - dream
    - ...
  • Our experiment results are averaged over three independent training dynamics, which corresponds to folder reproduce_1, reproduce_2, and reproduce_3.

Citation

@inproceedings{
    heyoco2023,
    title={You Only Condense Once: Two Rules for Pruning Condensed Datasets},
    author={Yang He and Lingao Xiao and Joey Tianyi Zhou},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=AlTyimRsLf}
}