Efficient PyTorch reimplementation of per-channel energy normalization with Mel spectrogram features.
Robustness to loudness differences in near- and far-field conditions is critical in high-quality speech recognition applications. Obviously, spectrogram energies differ significantly between, say, shouting at arms-length and whispering from a distance. This can worsen model quality, since the model itself would need to be robust across a wide range of input. The log-compression step in the popular log-Mel transform partially addresses this issue by reducing the dynamic range of audio; however, it ignores per-channel energy differences and is static by definition.
Per-channel energy normalization is one such solution to the aforementioned problems. It provides a per-channel, trainable front-end in place of the log compression, greatly improving model robustness in keyword spotting systems -- all the while being resource-efficient and easy to implement.
- PyTorch and NumPy are required. LibROSA and matplotlib are required only for the example.
- To install via pip, run
pip install git+https://github.com/daemon/pytorch-pcen
. Otherwise, clone this repository and runpython setup.py install
. - To run the example in the module, place a 16kHz WAV file named
yes.wav
in the current directory. Then, dopython -m pcen.pcen
.
The following is a self-contained example for using a streaming PCEN layer:
import pcen
import torch
# 40-dimensional features, 30-millisecond window, 10-millisecond shift; trainable is false by default
transform = pcen.StreamingPCENTransform(n_mels=40, n_fft=480, hop_length=160, trainable=True)
audio = torch.empty(1, 16000).normal_(0, 0.1) # Gaussian noise
# 1600 is an arbitrary chunk size; This step is unnecessary but demonstrates the streaming nature
streaming_chunks = audio.split(1600, 1)
pcen_chunks = [transform(chunk) for chunk in streaming_chunks] # Transform each chunk
transform.reset() # Reset the persistent streaming state
pcen_ = torch.cat(pcen_chunks, 1)
Wang, Yuxuan, Pascal Getreuer, Thad Hughes, Richard F. Lyon, and Rif A. Saurous. Trainable frontend for robust and far-field keyword spotting. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5670-5674. IEEE, 2017.
@inproceedings{wang2017trainable,
title={Trainable frontend for robust and far-field keyword spotting},
author={Wang, Yuxuan and Getreuer, Pascal and Hughes, Thad and Lyon, Richard F and Saurous, Rif A},
booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on},
pages={5670--5674},
year={2017},
organization={IEEE}
}