Listen to Interpret (L2I)

This repository constains code for the paper "Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF" by Jayneel Parekh, Sanjeel Parekh, Pavlo Mozharovskyi, Florence d'Alché-Buc, Gaël Richard (Accepted at NeurIPS 2022)

Link for the project webpage. Contains audio samples and interpretations for all the experiments.

Setup

Virtual Environment

Setup a new conda environment with the env_audio.yml file.

   conda env create -f env_audio.yml

NOTE: The default pytorch installed would be cpu usage only. If you wish to train with GPU (recommended, specially for SONYC-UST), please also run this command after creating the environment (or an equivalent according to your CUDA compatibility).

   conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=9.2 -c pytorch

Datasets

We perform experiments on two datasets: ESC-50 AND SONYC-UST. You will need to download and extract the datasets. Instructions for downlading the two are given below.

(1) ESC-50: Download link on their github page. Extract the downloaded zip to the path 'L2I-code/datasets'. Please ensure the name of the extracted folder is: 'ESC50'. The final path of dataset should look like 'L2I-code/datasets/ESC50'

(2) SONYC-UST: Download link and instructions on their Zenodo page. The dataset is offered under the CC BY 4.0 license by its authors. Ensure the name of dataset folder is 'SONYC_UST'. The final path of dataset should look like 'L2I-code/datasets/SONYC_UST'

If the datasets are downloaded and extracted in a different directory, please ensure the paths defined in L61 -- L75 of audint_posthoc.py match their location and names.

Network architecture and trained models

The trained models on ESC-50 fold-1, SONYC-UST are available in the 'output/esc50_output/', 'output/sust_output/' directories respectively along with the trained dictionaries on them.

The backbone network that we fine-tune and perform post-hoc interpretation on is based on the work of Kumar et al.. We have not uploaded the pre-trained weights (on AudioSet) of this network, but rather our fine-tuned weights on our datasets. To train on a new dataset please refer to the original repository.

Usage

audint_posthoc.py is our main file

1. Command: python audint_posthoc.py [mode test or train] [dataset name] [use-gpu True or False] [fine-tune-classifier True or False] [if mode is test, enter model name here as 4th argument]

2. Dataset options: esc50, sust.

3. Example commands: 
   python audint_posthoc.py test esc50 False False try10_AttExpv2.pt 
   python audint_posthoc.py test sust False False try14_AttExp_v2.pt.

These commands should generate the fidelity metrics by default and in case of ESC-50, also generate interpretations from overlap experiment. You can run other experiments (faithfulness or noise experiments) by uncommenting various parts from L1283 -- L1336 of audint_posthoc.py.

NOTE: The functionality for fine-tuning the classifier is not setup properly. Refer to Setup above about this issue.

 4. Command for training:
 python audint_posthoc.py train esc50 True False
 
 GPU-usage is recommended for training. Refer to Setup above for adding GPU-support.