PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition (forked by ML TUNI)
This repository is a fork of https://github.com/qiuqiangkong/audioset_tagging_cnn (which is the code release for paper PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Patterns Recognition [1])
We do not provide here the scripts to download the dataset and assume that the files are already prepared by the user.
The codes here assume that all files are already the audio segments which are to be studied
and not full recordings from which segments are to be extracted, that the
segments are 10 seconds long (although codes should support any length of audio common to all used clips)
and are named according to the pattern Y*.wav
,
i.e. begin with 'Y', followed by arbitrary symbols (video's YouTubeID) and having the extension '.wav'.
Data should have a sampling rate of 32000Hz,
although again, any sample rate common to all files is supported.
The files should be stored in two separate directories, train/
and eval/
, for the
training and evaluation splits, respectively.
Metadata files storing class labels for audios should be provided as tab-separated files, as in Google's AudioSet: Reformatted dataset.
NOTE: filename
fields should not contain the prefix "Y" or the extension ".wav", these
are added by the scripts.
These metadata files should only mention audio files that were actually downloaded, and only the classes that were selected for the model.
You can check the dataset folder to verify the format of all files.
Create the following environment variables to keep track of all files and locations:
AUDIOS_DIR_TRAIN # Location of the training split of files
AUDIOS_DIR_EVAL # Location of the evaluation split of files
DATASET_PATH_TRAIN # Path to the tsv file containing strong labels for the train split
DATASET_PATH_EVAL # Path to the tsv file containing strong labels for the eval split
LOGS_DIR # Location to store logs
Use the panns.data.target
module from the command line to save the
target array for train and eval splits.
Weak target array will have the shape (files, classes)
,
strong target array - (files, frames, classes)
.
Weak target can be computed once for a dataset, Strong target depends on
sample rate, hop length (in samples) and clip length (in ms).
Use the following environment variables:
TARGET_WEAK_PATH_TRAIN # Path to save the weak target array for train files
TARGET_WEAK_PATH_EVAL # Path to save the weak target array for eval files
TARGET_STRONG_PATH_TRAIN # Path to save the strong target array for train files
TARGET_STRONG_PATH_EVAL # Path to save the strong target array for eval files
And then call the module as follows:
# Weak target
# Train split
python -m panns.data.target --dataset_path=$DATASET_PATH_TRAIN\
--target_type=weak\
--target_path=$TARGET_WEAK_PATH_TRAIN
# Eval split
python -m panns.data.target --dataset_path=$DATASET_PATH_EVAL\
--target_type=weak\
--target_path=$TARGET_WEAK_PATH_EVAL
# Strong target
# Train split
python -m panns.data.target --dataset_path=$DATASET_PATH_TRAIN\
--target_type=strong\
--target_path=$TARGET_STRONG_PATH_TRAIN\
--sample_rate=32000\
--hop_length=320\
--clip_length=10000
# Eval split
python -m panns.data.target --dataset_path=$DATASET_PATH_EVAL\
--target_type=strong\
--target_path=$TARGET_STRONG_PATH_EVAL\
--sample_rate=32000\
--hop_length=320\
--clip_length=10000
For training and evaluation, the actual audio files need to be packed into a hdf5 object
using the panns.data.hdf5
module, which relies on the h5py package.
The audio arrays will be made to match the length clip length*sample rate/1000
either by truncating or zero-padding.
Create the following environment variables:
HDF5_FILES_PATH_TRAIN # Location for the hdf5 compression of train split
HDF5_FILES_PATH_EVAL # Location for the hdf5 compression of eval split
And make the calls:
# Train split
python -m panns.data.hdf5 --audios_dir=$AUDIOS_DIR_TRAIN\
--dataset_path=$DATASET_PATH_TRAIN\
--hdf5_path=$HDF5_FILES_PATH_TRAIN\
--logs_dir=$LOGS_DIR
--sample_rate=32000\
--clip_length=10000
# Eval split
python -m panns.data.hdf5 --audios_dir=$AUDIOS_DIR_EVAL\
--dataset_path=$DATASET_PATH_EVAL\
--hdf5_path=$HDF5_FILES_PATH_EVAL\
--logs_dir=$LOGS_DIR
--sample_rate=32000\
--clip_length=10000
NOTE Optionally --mini_data
parameter can be specified, which only packs
the given amount of files.
The models are defined in panns/models/models.py, some auxiliary classes are defined in panns/models/blocks.py.
Models have been significantly reworked compared to the original implementation.
In particular, custom-written torchlibrosa has been replaced with native torchaudio. This applies to Spectrogram extraction for models that require it as well as Spectrogram Augmentation.
Furthermore, many versions of the CNN14 model in the original implementation differed only by a handful of parameters that were hardcoded. They are now refactored into the main CNN14 model with the possibility to customize these parameters to resemble the original models. CNN6 and CNN10 also had these features inserted into them.
In general, these parameters are used to customize the models (some models only support some parameters, check the source):
classes_num
: Amount of classes usedwavegram
: Whether to use the Wavegram features (see [1])spectrogram
: Whether to use the log-mel-Spectrogram featuressample_rate
: Sample rate of the original audiowin_length
: Window length to use for MelSpectrogram extractionhop_length
: Hop length for the window of MelSpectrogram extractionn_mels
: Amount of mel filterbanks to use for MelSpectrogramf_min
: Minimum frequencyf_max
: Maximum frequencyspec_aug
: Whether to use SpectrogramAugmentation during training
mixup_time
: Whether to perform mixup in time-domain (before feature extraction)mixup_freq
: Whether to perform mixup in frequency domain (after feature extraction)dropout
: Whether to perform dropout during trainingdecision_level
: Whether to output strong labels (framewise_output
) and which function to use to generate them- Additional:
window_fn
,center
,pad_mode
: Passed to MelSpectrogramtop_db
: Passed to AmplitudeToDBnum_features
: Passed to BatchNorm2D (must be correct with respect to the input)embedding_size
: Amount of nodes connecting the last two layers of the model
Below is the 'conversion table' between models in the original and current implementations (note that in all cases parameters for Spectrogram are renamed):
Original | Current |
---|---|
Cnn6 |
Cnn6(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None) |
Cnn10 |
Cnn10(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None) |
Cnn14 |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None) |
Cnn14_8k |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, sample_rate=8000, win_length=256, hop_length=80, n_mels=64, f_mix=50, f_max=4000) |
Cnn14_16k |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, sample_rate=16000, win_length=512, hop_length=160, n_mels=64, f_mix=50, f_max=8000) |
Cnn14_no_specaug |
Cnn14(spec_aug=False, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None) |
Cnn14_no_dropout |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=False, wavegram=False, spectrogram=True, decision_level=None) |
Cnn14_mixup_time_domain |
Cnn14(spec_aug=True, mixup_time=True, mixup_freq=False, dropout=True, wavegram=False, spectrogram=True, decision_level=None) |
Cnn14_emb32 |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, embedding_size=32) |
Cnn14_emb128 |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, embedding_size=128) |
Cnn14_emb512 |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, embedding_size=512) |
Cnn14_mel32 |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, num_features=32) |
Cnn14_mel128 |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, num_features=128) |
Cnn14_DecisionLevelMax |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level='max') |
Cnn14_DecisionLevelAvg |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level='avg') |
Cnn14_DecisionLevelAtt |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level='att') |
Wavegram_Cnn14 |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=True, spectrogram=False, decision_level=None) |
Wavegram_Logmel_Cnn14 |
Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=True, spectrogram=True, decision_level=None) |
Wavegram_Logmel128_Cnn14 |
Not implemented |
ResNet22 |
ResNet22 (same) |
ResNet38 |
ResNet38 (same) |
ResNet54 |
ResNet54 (same) |
Res1dNet31 |
Res1dNet31(classes_num) (other parameters are not used) |
Res1dNet51 |
Res1dNet51(classes_num) (other parameters are not used) |
MobileNetV1 |
MobileNetV1 (same) |
MobileNetV2 |
MobileNetV2 (same) |
LeeNet11 |
LeeNet11(classes_num) (other parameters are not used) |
LeeNet24 |
LeeNet24(classes_num, dropout=True) (other parameters are not used, dropout can be set to False ) |
DaiNet19 |
DaiNet19(classes_num) (other parameters are not used) |
Training is performed using panns/train.py. Training is controlled by the following parameters:
- Model configuration:
model_type
: One of the classes in panns/models/models.py (the model used)- Parameters for the model (see Models):
classes_num
,sample_rate
,win_length
,hop_length
,f_min
,f_max
,n_mels
,decision_level
,pad_mode
,top_db
,num_features
,embedding_size
: Passed directly to model constructor, see Modelsspec_aug/no_spec_aug
,mixup_time/no_mixup_time
,mixup_freq/no_mixup_freq
,dropout/no_dropout
,wavegram/no_wavegram
,spectrogram/no_spectrogram
,center/no_center
: Set the corresponding model parameter toTrue/False
respectively
- Files locations:
hdf5_files_path_train
,hdf5_file_path_eval
: Location of the HDF5 files, see [HDF5](#Pack waveforms into hdf5 files)target_path_train
,target_path_eval
: Location of the target arrays for train/eval split, either weak or stronglogs_dir
: Folder to store logs (defaultlogs
in CWD)checkpoints_dir
: Folder to store checkpoints every 100000 iterations (defaultcheckpoints
in CWD)statistics_dir
: Folder to save evaluation results every 2000 iterations (defaultstatistics
in CWD)resume_checkpoint_path
: Location to load a trained model checkpoint from
- Control training loop:
label_type
: Whether to use weak or strong label output from the model to calculate BCE loss, must be the same as dataset target given andstrong
can only be used with compatible models (ones that havedecision_level
parameter)batch_size
: amount of files used in one training iterationlearning_rate
: learning rate for the optimizeriter_max
: Amount of training iterations to perform (an 'iteration' is processing of one batch, we do not use epochs in this pipeline)num_workers
: Amount of workers to pass to the DataLoadercuda
: Whether to use GPU (flag)
Example of initiating training:
python -m panns.train --hdf5_files_path_train=$HDF5_FILES_PATH_TRAIN\
--hdf5_files_path_eval=$HDF5_FILES_PATH_EVAL\
--target_path_train=$TARGET_WEAK_PATH_TRAIN\
--target_path_eval=$TARGET_WEAK_PATH_EVAL\
--label_type='weak'\
--model_type='Cnn14'\
--classes_num=110\
--decision_level='max'\
--spectrogram\
--win_length=1024\
--hop_length=320\
--sample_rate=32000\
--f_min=50\
--f_max=14000\
--n_mels=64\
--spec_aug\
--no_mixup_time\
--no_mixup_freq\
--dropout\
--no_wavegram\
--batch_size=32\
--learning_rate=1e-3\
--iter_max=600000\
--num_workers=8
--cuda
It is possible to produce a file with strong labels for a given dataset
inferred
from the trained model in the same format as files in dataset
using panns.inference.
For that a checkpoint of the trained model is needed
as well as an hdf5
compression of the evaluation set.
The script accepts following parameters:
- File parameters
hdf5_files_path
: location of thehdf5
compression of the datasetdataset_path
: location of the dataset tsv fileoutput_path
: filename to save the detected eventscheckpoint_path
: location of the checkpoint of the model to uselogs_dir
: directory to write logs into (optional)
- Model parameters: same as during Training phase
batch_size
,cuda
,num_workers
: Control passing data to the model similarly to training phase- Inference parameters:
threshold
: This threshold is applied to the output of the model only values greater are considered as 'event detected'minimum_event_gap
: In seconds, minimum gap between two consecutive events so that they are considered separate events; events closer than this are merged together by filling the small gapminimum_event_length
: In seconds, events shorter than this are ignored (first gaps are closed, than short events removed)
Example of inference:
python -m panns.inference --hdf5_files_path=$HDF5_FILES_PATH_EVAL\
--dataset_path=$DATASET_PATH_EVAL\
--checkpoint_path=\ # Path to checkpoint in $CHECKPOINTS_DIR
--output_path='inference.tsv'\
--logs_dir=$LOGS_DIR\
--batch_size=32\
--num_workers=8\
--cuda\
--threshold=0.5\
--minimum_event_gap=0.1\
--minimum_event_length=0.1\
--model_type='Cnn14'\
--classes_num=110\
--decision_level='max'\
--spectrogram\
--win_length=1024\
--hop_length=320\
--sample_rate=32000\
--f_min=50\
--f_max=14000\
--n_mels=64\
--spec_aug\
--no_mixup_time\
--no_mixup_freq\
--dropout\
--no_wavegram
[1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. "Panns: Large-scale pretrained audio neural networks for audio pattern recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2880-2894.
[2] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M. and Ritter, M., 2017, March. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776-780, 2017
[3] Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B. and Slaney, M., 2017, March. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131-135, 2017