This repository is the official implementation of Expansile Validation Dataset (EVD): Towards Resolving The Train/Validation Split Tradeoff.
To install requirements:
pip install -r requirements.txt
Each folder contains four core files(ie., augmentation.py, sample.py, data_extender.py, feature_distribution.py), the logical role of the files with the same name under different folders are the same, but they are implemented differently due to the different types of data they act on. The role of each file is explained below:
augmentation.py: generation of auxiliary dataset by data augmentationssample.py: coreset operation to generate validation set from the auxiliary datasetdata_extender.py: validation set iterative expansionfeature_distribution.py: calculate feature distribution
Note that, for the sake of brevity, comments on the code are mainly placed in the files under the CV folder.
For Tabular Data
All datasets are provided in 'TabularData/datasets/'. And you could also download them from kaggle or UCI.
For Text
Except for the feature extractor, all files are already provided in the corresponding folders. If you want to generate these files from scratch, execute the following commands in order.
# Configure the path_config.yaml, to create configured path, run:
python untils.py
# To load dataset from nltk and save, run:
python reuters.py
# Change the path of models in nlpaug to your local path, and run
CUDA_VISIBLE_DEVICES=0 python augmentation.py
# To obtain feature extractor, run (download pretrained models from https://huggingface.co/models if some caching errors occur)
CUDA_VISIBLE_DEVICES=0 python feature_extractor.py
# To obtain initial val-set by coreset operation, run
CUDA_VISIBLE_DEVICES=0 python samply.pyFor CV
Due to the large file size of the image dataset, we here only provide the command to obtain these files rather than the original files.
# Configure the path_config.yaml, to create configured path, run:
python utils.py
# Generate cifar10-longtail
python imbalanced_dataset.py
# To generate auxiliary dataset
python augmentation.py
# To obtain feature extractor, run
CUDA_VISIBLE_DEVICES=0 python feature_extractor.py
# To obtain initial val-set by coreset operation, run
CUDA_VISIBLE_DEVICES=0 python sample.pyTo get the results in the paper, run following commands:
Results of tabular data
# Configure the save path in run_all_xgb.sh and run
./run_all_xgb.sh
# Configure the save path in run_all_xgb_coreset.sh and then run
./run_all_xgb_coreset.sh
# Configure the same path as the two above at the beginning of the file
# then run the command to get the statistics in all Tables
python record.pyResults of Reuters(Text, NLP)
# For results in Tabel 2
CUDA_VISIBLE_DEVICES=0 python reuter_eval_main.py -vm holdout --k 1 --save_name xxx
CUDA_VISIBLE_DEVICES=0 python reuter_eval_main.py -vm kfold --k 5 --save_name xxx
CUDA_VISIBLE_DEVICES=0 python reuter_eval_main.py -vm jkfold --k 5 --J 4 --save_name xxx
CUDA_VISIBLE_DEVICES=0 python reuter_eval_main.py -vm aug_coreset_whole --k 1 --fe_type fine-tune --feature_dis_type NDB --save_name xxx
# For results in Tabel 5
CUDA_VISIBLE_DEVICES=0 python reuter_eval_main.py -vm coreset_part_holdout --k 1 --save_name xxx
# For results in Tabel 6
CUDA_VISIBLE_DEVICES=0 python reuter_eval_main.py -vm aug_holdout --k 1 --fe_type fine-tune --feature_dis_type NDB --save_name xxx
CUDA_VISIBLE_DEVICES=0 python reuter_eval_main.py -vm aug_kfold --k 5 --fe_type fine-tune --feature_dis_type NDB --save_name xxx
# For results in Figure 2
python reuters_hyper_params_search.pyResults of CIFAR-10-LT(Image, CV)
# For results in Tabel 2
CUDA_VISIBLE_DEVICES=0 python cifar10_eval_main.py -vm holdout --k 1 --save-dir xxx
CUDA_VISIBLE_DEVICES=0 python cifar10_eval_main.py -vm kfold --k 5 --save-dir xxx
CUDA_VISIBLE_DEVICES=0 python cifar10_eval_main.py -vm jkfold --J 4 --k 5 --save-dir xxx
CUDA_VISIBLE_DEVICES=0 python cifar10_eval_main.py -vm coreset_whole --k 1 --save-dir xxx
# For results in Tabel 4
CUDA_VISIBLE_DEVICES=0 python cifar10_eval_main.py -vm random_coreset --k 1 --save-dir xxx
# For results in Tabel 5
CUDA_VISIBLE_DEVICES=0 python cifar10_eval_main.py -vm coreset_part_holdout --k 1 --save-dir xxx
# For results in Tabel 6
CUDA_VISIBLE_DEVICES=0 python cifar10_eval_main.py -vm aug_holdout --k 1 --feature_dis_type NDB --config_path ./config/cifar10_default.yaml --save-dir xxx
CUDA_VISIBLE_DEVICES=0 python cifar10_eval_main.py -vm aug_kfold --k 5 --feature_dis_type NDB --config_path ./config/cifar10_default.yaml --save-dir xxx