Conformer

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition (Gulati et al., 2020).

Quickstart

Clone this repository.

git clone https://github.com/jaketae/conformer.git

Navigate to the cloned directory. You can start using the model via

>>> from conformer import ConformerEncoder
>>> model = ConformerEncoder()

By default, the model comes with the following parameters:

ConformerEncoder(
    num_blocks=6,
    d_model=256,
    num_heads=4,
    max_len=512,
    expansion_factor=4,
    kernel_size=31,
    dropout=0.1,
)

Introduction

The Transformer (Vaswani et al., 2017) has proven to be immensely successful in various domains, such as machine translation, language modeling, and more recently, computer vision. An important reason behind the success of the transformer architecture is self-attention, which allows the model to attend to the entire input sequence to generate rich feature representations.

A more traditional model architecture, convolution neural networks have widely been used in the vision domain. The sliding kernel structure encodes meaningful inductive biases such as translational invariance, making them suitable as local feature extractors.

The Conformer seeks to combine the best of both worlds: global features are extracted by the transformer, whereas local features are learned by the convolution module. The Conformer model has proven to be effective in automatic speech recognition.

Note

While the original paper used Conformer specifically in the context of ASR, I implemented this model in the hopes of applying it as a general feature encoder in audio generation tasks, such as speech prosody transfer, voice conversation, and singing voice synthesis. For this reason, the current implementation only includes the encoder portion of the Conformer architecture and also lacks components such as downsampling and SpecAugment.

Credit

This implementation was heavily influenced by Soohwan Kim's implementation of Conformer. The skewing logic employed in relative positional encoding was inspired by Prayag Chatha's implementation of Music Transformer (Huang et al., 2018).