ChaosMining

Source code of "ChaosMining: A Benchmark to Evaluate Post-Hoc Local Attribution Methods in Low SNR Environments"

Repo Structure

.
├── chaosmining             # Source files, define the major modules to build machine learning pipeline with pytorch
├── data                    # Data folder, including functional, vision, and audio data (Note this folder is a placeholder for preview, user need to download data from huggingface datasets)
├── data_engineer           # Source files, create synthetic dataset and conduct preprocessing
├── examples                # Source files, perform model training, evaluation, and localization 
├── notebooks               # Source files, notebooks to get preliminary results, light weight experiment results, and plots
├── exps                    # Bash scripts, linux commands to run bash experiments (Note this folder is an empty placeholder, we didn't disclose the bash scripts to run experiments due to heterogeneous system settings.)
├── LICENSE
└── README.md

Dataset

The dataset is accessible on huggingface with a DOI doi:10.57967/hf/2482 A detailed datacard, dataset viewer, and Croissant metadata are accessible through huggingface. We provide the dataset structure here as a preview.

./data/
├── audio/
│   ├── RBFP/
│   │   ├── train/
│   │   │   ├── meta_data.csv
│   │   │   └── ...
│   │   ├── eval/
│   │   │   ├── meta_data.csv
│   │   │   └── ...
│   ├── RBRP/
│   │   └── ...
│   ├── SBFP/
│   │   └── ...
│   ├── SBRP/
│   │   └── ...
├── vision/
│   ├── RBFP/
│   │   ├── train/
│   │   │   ├── meta_data.csv
│   │   │   └── ...
│   │   ├── eval/
│   │   │   ├── meta_data.csv
│   │   │   └── ...
│   ├── RBRP/
│   │   └── ...
│   ├── SBFP/
│   │   └── ...
│   ├── SBRP/
│   │   └── ...
└── symbolic_simulation/
    └── formula.csv

Install Environment

Install a list of requirements specified in a Requirements File.

foo@bar:~$ cd /path/to/package
foo@bar:~$ python -m venv env
foo@bar:~$ source env/bin/activate
foo@bar:~$ pip install -e .

Run Data Creation

We disclose the code to generate the synthetic dataset we used. The codes are in the data_engineer directory. Here are the commands to create audio data and vision data respectively.

foo@bar:~$ python data_engineer/create_audio_data.py --input_path_fg FOREGROUND_INPUT_PATH --input_path_bg BACKGROUND_INPUT_PATH --output_path OUTPUT_PATH 
foo@bar:~$ python data_engineer/create_vision_data.py --input_path INPUT_PATH --output_path OUTPUT_PATH

Run Experiments

We provide example codes to run experiments. Users may write bash job scripts to call these codes to reproduce the experimental results.

For experiments on symbolic functional data, you may run

foo@bar:~$ python examples/train_eval_simulation.py -d ./data/symbolic_simulation/formula.csv -e ./runs/simulation/ -n 14 -s SEED --num_noises 100 --ny_var 0.01 --optimizer Adam --learning_rate 0.001 --deterministic

For experiments on feature selection with neural networks and attribution methods on simulation data, you may run

foo@bar:~$ python examples/RFEwNA_simulation.py -d ./data/symbolic_simulation/formula.csv -e ./runs/RFEwNA -n rfe_ig -s SEED -g 0 --num_noises 100 --ny_var 0.01 --optimizer Adam --learning_rate 0.001 --dropout 0.0 --xai ig --deterministic

For experiments on vision data, you may first run model training and evaluation code and then run the experiment for benchmarking post-hoc attribution methods

""" The command to run training and evaluation codes. """
foo@bar:~$ python examples/train_eval_vision.py -d ./data/vision/RBFP/ -e ./runs/vision/RBFP/ -n arc_vit_b_16 -s SEED --model_name vit_b_16 --gpu 0 --num_classes 10 --num_epochs 30 --batch_size 128 --learning_rate 0.001 --pretrained --deterministic
""" The command to run post-hoc attribution methods benchmarking codes. """
foo@bar:~$ python examples/eval_vision_localization.py -d ./data/vision/RBFP/ -e ./runs/vision/RBFP/ -n arc_vit_b_16 -s SEED --model_name vit_b_16 --gpu 0 --num_classes 10 --batch_size 2 --deterministic

For experiments on audio data, similarly, you may first run model training and evaluation code and then run the experiment for benchmarking post-hoc attribution methods

""" The command to run training and evaluation codes. """
foo@bar:~$ python examples/train_eval_audio.py -d ./data/audio/RBFP/ -e ./runs/audio/RBFP/ -n arc_TRAN -s SEED --model_name TRAN --n_channels 10 --length 16000 --gpu 0 --num_epochs 30 --batch_size 128 --learning_rate 0.0001 --deterministic
""" The command to run post-hoc attribution methods benchmarking codes. """
foo@bar:~$ python examples/eval_audio_localization.py -d ./data/audio/RBFP/ -e ./runs/audio/RBFP/ -n arc_RNN -s SEED --model_name RNN --n_channels 10 --length 16000 --gpu 0 --batch_size 32 --deterministic

Citation

If you use this code or data in your research, please cite the following:

@misc {ge_shi_2024,
	author       = { {Ge Shi} },
	title        = { chaosmining (Revision 6c23193) },
	year         = 2024,
	url          = { https://huggingface.co/datasets/geshijoker/chaosmining },
	doi          = { 10.57967/hf/2482 },
	publisher    = { Hugging Face }
}

@misc{shi2024chaosminingbenchmarkevaluateposthoc,
      title={ChaosMining: A Benchmark to Evaluate Post-Hoc Local Attribution Methods in Low SNR Environments}, 
      author={Ge Shi and Ziwen Kan and Jason Smucny and Ian Davidson},
      year={2024},
      eprint={2406.12150},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.12150}, 
}

geshijoker/ChaosMining