You Only Condense Once (YOCO)

[Paper] [BibTeX]

On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules, Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive.

Getting Started

First, download our repo:

https://github.com/he-y/you-only-condense-once.git
cd you-only-condense-once

Second, create conda environment: The code has been tested with Pytorch 1.11.0 with Python 3.9.15.

# create conda environment
conda create -n yoco python=3.9
conda activate yoco

Third, install the required dependencies:

pip install -r requirements.txt

Our code is mainly based on two repositories:

Main Files of the Repo

get_training_dynamics.py trains a model and track the training dynamics based on condensed datasets.
generate_importance_score.py generate importance score according to the stored training dynamic files.
utils/img_loader.py loads condensed datasets with target IPC according to the pre-computed importance scores.

Module 1: Condensed Dataset Preparation (Google Drive File)

The condensed datasets used in our experiments can be downloaded from google drive. The downloaded datasets should follow below file structure:

YOCO
- raid
  - condensed_img
    - dream
    - idc
    - ...

condense_key in below table denotes condensed datasets obtained by which method are evaluated. Our experiment results are mainly tested on IDC, so default setting is condense_key = idc.

`condense_key`	Description
`idc`	Dataset Condensation via Efficient Synthetic-Data Parameterization (IDC)
`dream`	Efficient Dataset Distillation by Representative Matching (DREAM)
`mtt`	Dataset Distillation by Matching Training Trajectories (MTT)
`dsa`	Dataset Condensation with Differentiable Siamese Augmentation (DSA)
`kip`	Dataset Distillation with Inﬁnitely Wide Convolutional Networks (KIP)

If you want to condense by yourself, run:

python condense.py --reproduce_condense  -d [dataset] -f [factor] --ipc [images per class]

Module 2: Pruning the Condensed Datasets via Three Steps (Google Drive File)

Step 1: Generate the training dyanmics from the condensed dataset (or you can directly downloaded our generated training dynamics here):

python get_training_dynamics.py --dataset [dataset] --ipc [IPCF] --condense_key [condensation method]

Step 2: Generate the score file for each image according to the training dynamic:

python generate_importance_score.py --dataset [dataset] --ipc [IPCF] --condense_key [condensation method]

Step 3: Evaluate the performance using different dataset pruning metrics

python test.py -d [dataset] --ipc [IPCF] --slct_ipc [IPCT] --pruning_key [pruning method] --condense_key [condensation method]

pruning_key denotes different dataset pruning methods including:

`pruning_key`	Description	Prefer hard/easy?	Balanced?
`random`	Random Selection	N/A	no
`ssp`	Self-Supervised Prototype	hard	no
`entropy`	Entropy	hard	no
`accumulated_margin`	Area Under the Margin	hard	no
`forgetting`	Forgetting score	hard	no
`el2n`	EL2N score	hard	no
`ccs`	Coverage-centric Coreset Selection	easy	no
`yoco`	Our method	easy	yes

Prefer hard/easy? means the method prefer hard samples or easy samples.
Balanced means the method consider balanced or not.
ccs prunes hard images identified by el2n score (in our implementation).

To alter the components for each metric, we can append following suffixes after pruning_key:

suffix	explanation
`_easy` / `_hard`	Whether to use easy / hard samples
`_balance` / `_imbalance`	Whether to have balance / imbalance class distribution

For example, default forgetting metric is equivalent to forgetting_hard_imbalance, prefer hard and not balanced.

Changes to forgetting_easy to prefer easy.
Changes to forgetting_balance to construct balanced samples.
Changes to forgetting_easy_balance or forgetting_balance_easy to prefer easy + balanced.

Table reproducing

For the ease of reproducing experiment results, we provide the bash shell scripts for each table. The scripts can be found in scripts\table[x].sh. The training dynamics and scores used in our experiments can be downloaded from google drive. Note: the training dynamics contains large files (e.g., idc/cifar100 is ~6GB).

The downloaded files should follow below file structure:

YOCO
- raid
  - reproduce_*
      - dynamics_and_scores
        - idc
        - dream
        - ...
  - condensed_img (download from Module 1)
    - idc
    - dream
    - ...
Our experiment results are averaged over three independent training dynamics, which corresponds to folder reproduce_1, reproduce_2, and reproduce_3.

Citation

@inproceedings{
    heyoco2023,
    title={You Only Condense Once: Two Rules for Pruning Condensed Datasets},
    author={Yang He and Lingao Xiao and Joey Tianyi Zhou},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=AlTyimRsLf}
}