On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules, Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive.
First, download our repo:
https://github.com/he-y/you-only-condense-once.git
cd you-only-condense-once
Second, create conda environment:
The code has been tested with Pytorch 1.11.0 with Python 3.9.15.
# create conda environment
conda create -n yoco python=3.9
conda activate yoco
Third, install the required dependencies:
pip install -r requirements.txt
Our code is mainly based on two repositories:
Main Files of the Repo
get_training_dynamics.pytrains a model and track the training dynamics based on condensed datasets.generate_importance_score.pygenerate importance score according to the stored training dynamic files.utils/img_loader.pyloads condensed datasets with target IPC according to the pre-computed importance scores.
Module 1: Condensed Dataset Preparation (Google Drive File)
The condensed datasets used in our experiments can be downloaded from google drive. The downloaded datasets should follow below file structure:
YOCO - raid - condensed_img - dream - idc - ...
condense_key in below table denotes condensed datasets obtained by which method are evaluated. Our experiment results are mainly tested on IDC, so default setting is condense_key = idc.
If you want to condense by yourself, run:
python condense.py --reproduce_condense -d [dataset] -f [factor] --ipc [images per class]Module 2: Pruning the Condensed Datasets via Three Steps (Google Drive File)
Step 1: Generate the training dyanmics from the condensed dataset (or you can directly downloaded our generated training dynamics here):
python get_training_dynamics.py --dataset [dataset] --ipc [IPCF] --condense_key [condensation method]Step 2: Generate the score file for each image according to the training dynamic:
python generate_importance_score.py --dataset [dataset] --ipc [IPCF] --condense_key [condensation method]Step 3: Evaluate the performance using different dataset pruning metrics
python test.py -d [dataset] --ipc [IPCF] --slct_ipc [IPCT] --pruning_key [pruning method] --condense_key [condensation method]pruning_key denotes different dataset pruning methods including:
pruning_key |
Description | Prefer hard/easy? | Balanced? |
|---|---|---|---|
random |
Random Selection | N/A | no |
ssp |
Self-Supervised Prototype | hard | no |
entropy |
Entropy | hard | no |
accumulated_margin |
Area Under the Margin | hard | no |
forgetting |
Forgetting score | hard | no |
el2n |
EL2N score | hard | no |
ccs |
Coverage-centric Coreset Selection | easy | no |
yoco |
Our method | easy | yes |
Prefer hard/easy?means the method prefer hard samples or easy samples.Balancedmeans the method consider balanced or not.ccsprunes hard images identified byel2nscore (in our implementation).
To alter the components for each metric, we can append following suffixes after pruning_key:
| suffix | explanation |
|---|---|
_easy / _hard |
Whether to use easy / hard samples |
_balance / _imbalance |
Whether to have balance / imbalance class distribution |
For example, default forgetting metric is equivalent to forgetting_hard_imbalance, prefer hard and not balanced.
- Changes to
forgetting_easyto prefer easy. - Changes to
forgetting_balanceto construct balanced samples. - Changes to
forgetting_easy_balanceorforgetting_balance_easyto prefer easy + balanced.
For the ease of reproducing experiment results, we provide the bash shell scripts for each table. The scripts can be found in scripts\table[x].sh. The training dynamics and scores used in our experiments can be downloaded from google drive. Note: the training dynamics contains large files (e.g., idc/cifar100 is ~6GB).
The downloaded files should follow below file structure:
YOCO - raid - reproduce_* - dynamics_and_scores - idc - dream - ... - condensed_img (download from Module 1) - idc - dream - ...
- Our experiment results are averaged over three independent training dynamics, which corresponds to folder
reproduce_1,reproduce_2, andreproduce_3.
@inproceedings{
heyoco2023,
title={You Only Condense Once: Two Rules for Pruning Condensed Datasets},
author={Yang He and Lingao Xiao and Joey Tianyi Zhou},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=AlTyimRsLf}
}