Combine deep feature models: create a basic script that solves any audio classification task using pyAudioAnalysis
Closed this issue · 11 comments
In a new folder, create a feature extraction wrapper that uses pyAudioAnalysis (for the time being) to extract features from audio data (organized in folders). Input: a configuration file as a list of feature parameters, e.g.
{ 'feature_methods': [ 'pyAudioAnalysis': {'mt':1, 'st': 0.05, 'mt_step': 1, 'st_step': = 0.01}, 'cnn_1': {'model_path': 'CNN1.pt', ...}, 'cnn_2': {'model_path': 'CNN2.pt', ...}, .... ], 'classifier': 'svm' ... }
Then, a training / evaluation script trains a model (SVM rbg kernel) and returns (a) evaluation results (b) a model saved in a file along with the input configuration (see above)
Third, write a tester that takes the model+configuration pickle and returns a prediction for a list of wav files or a single wav file
combine/feature_extraction.py
combine/trainer.py (call feature_extraction.py)
combine/tester.py
use MidTermFeatures.mid_feature_extraction() to extract sequences of mid-term features for each input file e.g. if mt=1 second and the input file is 4 then this will return 4xnum_of_pyaudioanalysis_mid_features.
in the future the same needs to be done for the cnn feature extractors @lobracost
For the time being, trainer will just do averaging of the feature vectors (just like pyaudioanalysis does). In the future we may consider lstm
That's for a task we've discussed for @lobracost but you will check the PR if you have time (that will take some days though so i wouldnt expect it 4 or 5 days from now).
@tyiannak I tested the below options:
- resizing the input images instead of zero padding
- using fused features (mel spectrogram + chromagram)
For all the below models, the data spliting and pytorch seeds were the same.
That means that all the models were trained on the exact same train and validation sets,
as well as with the same torch random variables. In fact, I checked that for 2 train procedures
under the same parameters the output models were exactly the same. So we can compare the
below models.
Does resizing achieve the same performance as zero padding for fixed lengths?
For the 4class dataset which has fixed audio length I got the following results:
-
Using zero padding (current code version):
Accuracy: 73.21428571428571%
-
Using resizing:
Size equal to max_length (21x128): Accuracy: 72.32142857142857%
So we see that if we resize the image to the max_length the CNN has almost the same performance.
Unfortunately in this data there is no need for neither zero padding nor resizing, since this dataset
has identical audio lengths for every file. That means that we cannot compare the zero padding against resizing using this data.
-
zero padding & fused features (mel spectrogram + chromagram):
Accuracy: 76.78571428571429%
Which choice is better for datasets with different audio lengths?
For a subset of the movieSegment dataset in which audio lengths vary between 0.a and 14secs (thats what the max_length variable indicates) I got (without fused features):
max_length = 140
-
Using zero padding:
Accuracy: 83.98576512455516% -
Using resizing:
-
50x128 (~5 secs):
This is close to the average audio length for this dataset
Accuracy: 88.25622775800713% -
35x128 (~3.5 secs):
Accuracy: 88.25622775800713% -
20x128 (~2 secs):
Accuracy: 80.78291814946618% -
65x128 (~6.5 secs):
Accuracy: 87.90035587188612% -
80x128 (~8 secs):
Accuracy: 81.13879003558719%
-
Conlusions:
-
The fusion of mel spectogram with chromagram gives an extra 3.5% accuracy in the above examples.
-
-
The image resizing gives us the ability to handle different sized audio files. Zero padding provided me with memory allocation errors when the max_length variable was big enough. Instead this is not the case for resizing.
-
For datasets with different audio file lengths, we can achieve better performance, compared to zero-padding, if we find a good sizing parameter. In this sense we can tune the resizing parameter to different datasets.
-
Seems that for resizing to the range A = [0.7 * average_audio_length, 1.3 * average_audio_length] we get the same performance regardless the new size . Of course this outcome holds, for the time being, only for this specific dataset. However similar properties must hold for all datasets.
-
If we resize to a length outside of the range A, then we get worst outcomes.
-
Overall, I think that resizing solves our problem, and if tuned right can achieve better performance.
What do you think?
@lobracost that's great experimentation and discussion.
Let's do some more evaluation experiments on the chromagram/spectrogram fusion issue for several datasets
The 2nd issue (how to handle different audio lengths) needs some more discussion. We should consider more factors before deciding: e.g. where will the CNNs be used? If they are to be used as end2end classifiers then what you describe makes sense: we will select the setup that leads to the best performance and it seems that resizing is indeed the most promising (though more experiments are needed there as well). But if the plan is to use CNNs as feature extractors of fixed-sized segments then we may also consider the following:
- cut each variable sized segment to fix-sized, say, 1-sec segments
- let each fix-sized segment get the label of the larger segment it belongs to
- do train/test taking into consideration that the 1-sec segments from the same file must be either in test or train
- the output cnn will be trained on fix-sized segments
In all cases we need more experimentation for more classification tasks
I modified the code so the size of images would be the average of the spectrograms found on the training data. However in my example this is not the optimal choice. I trained the CNN to a subset of the Moviesegment dataset.
The below histograms represent the distribution of spectrogram sizes:
Train data:
Validation data:
Test data:
I trained the CNN for different resizing values + zero padding. All seeds were the same. Below the results are presented:
-----------------------Zero padding-----------------------
Classification report:
precision recall f1-score support
0 0.89 0.80 0.84 96
1 0.84 0.52 0.64 91
2 0.65 0.96 0.78 94
accuracy 0.76 281
macro avg 0.79 0.76 0.75 281
weighted avg 0.79 0.76 0.75 281
-----------------------Resizing-----------------------
-Image size: Average spctrogram length = 58
Classification report:
precision recall f1-score support
0 0.85 0.70 0.77 96
1 0.76 0.81 0.79 91
2 0.76 0.85 0.80 94
accuracy 0.79 281
macro avg 0.79 0.79 0.79 281
weighted avg 0.79 0.79 0.79 281
- Image size: 60
Classification report:
precision recall f1-score support
0 0.85 0.75 0.80 96
1 0.75 0.90 0.82 91
2 0.88 0.81 0.84 94
accuracy 0.82 281
macro avg 0.83 0.82 0.82 281
weighted avg 0.83 0.82 0.82 281
- Image size: 50
Classification report:
precision recall f1-score support
0 0.82 0.88 0.85 96
1 0.77 0.91 0.83 91
2 0.94 0.71 0.81 94
accuracy 0.83 281
macro avg 0.85 0.83 0.83 281
weighted avg 0.85 0.83 0.83 281
- Image size: 45
Classification report:
precision recall f1-score support
0 0.86 0.82 0.84 96
1 0.70 0.91 0.79 91
2 0.93 0.69 0.79 94
accuracy 0.81 281
macro avg 0.83 0.81 0.81 281
weighted avg 0.83 0.81 0.81 281
Unfortunately choosing the average size was the worst case among resizing but still better than zero padding.
@tyiannak do you any idea on which strategy to follow? Should we keep resizing to the average size for the time being?
@tyiannak I trained 2 models on the MovieSemgment dataset. The first with average size 58 (classifier accuracy 79%) and the second with size = 51 (classifier accuracy 87%)
I then used these models as feature extractors to train an SVM, together with pyaudio features, on emotions_music_in_folders/energy dataset.
---------------------------No PCA---------------------------
-
2 sec data (Spectrogram time dimension = 41)
-
Pretrained CNN on Moviesegment with size = 58 (average size)
accuracy 0.54 -
Pretrained CNN on Moviesegment with size = 51
accuracy 0.50 -
pyaudio features alone:
accuracy 0.63
-
-
5 sec data (Spectrogram time dimension = 101)
-
Pretrained CNN on Moviesegment with size = 58 (average size)
accuracy 0.58 -
Pretrained CNN on Moviesegment with size = 51
accuracy 0.48 -
pyaudio features alone:
accuracy 0.63
-
---------------------------PCA CNN features to 136 components----------------------------------
-
2 sec data (Spectrogram time dimension = 41)
-
Pretrained CNN on Moviesegment with size = 58 (average size)
accuracy 0.48 -
Pretrained CNN on Moviesegment with size = 51
accuracy 0.36
-
-
5 sec data (Spectrogram time dimension = 101)
-
Pretrained CNN on Moviesegment with size = 58 (average size)
accuracy 0.45
-
Pretrained CNN on Moviesegment with size = 51
accuracy 0.32
-
Good news is average size is better, regardless the input image size of the meta-classifier. On the other hand, SVM has better performance when only pyaudio features are provided... Maybe that's because the CNNs are not powerful enough since they are trained in small dataset (subset of Moviesegment) + overlapping was not used. Also applying PCA to the CNN features gave worst perfomance.
About overlapping, I want to ask what to do when the CNN size is greater than the input images. In such cases no overlapping is needed. Maybe resize?
Just to be sure: by "pretrained CNNs" you mean CNNs as features + SVM right? Not tuning on the test dataset?
In that case, u should also test the CNN features + pyAudio features (early fusion)
About your question on overlapping: yes, the only solution i see in case of larger segments is resizing.
Just to be sure: by "pretrained CNNs" you mean CNNs as features + SVM right? Not tuning on the test dataset?
In that case, u should also test the CNN features + pyAudio features (early fusion)About your question on overlapping: yes, the only solution i see in case of larger segments is resizing.
In every evaluation I used pretrained CNNs as feature extractors + pyAudio features + SVM and tuned them to the test dataset. As you can see early fusion gave us worse performance than pyAudio features alone. Maybe I have to train more powerful CNNs.