/joeys2t

Minimalist Speech-to-Text toolkit for educational purposes

Primary LanguagePythonApache License 2.0Apache-2.0

  Joey-S2T Joey S2T

build License arXiv

JoeyS2T is a JoeyNMT extension for Speech-to-Text tasks such as Automatic Speech Recognition (ASR) and end-to-end Speech Translation (ST). It inherits the core philosophy of JoeyNMT, a minimalist novice-friendly toolkit built on PyTorch, seeking simplicity and accessibility.

What's new

  • Upgraded to JoeyNMT v2.3.
  • Our paper has been accepted at EMNLP 2022 System Demo Track!

Features

JoeyS2T implements the following features:

  • Transformer Encoder-Decoder
  • 1d-Conv Subsampling
  • Cross-entropy and CTC joint objective
  • Mel filterbank spectrogram extraction
  • CMVN, SpecAugment
  • WER evaluation

Furthermore, all the functionalities in JoeyNMT v2 are also available from JoeyS2T:

  • BLEU and ChrF evaluation
  • BPE tokenization (with BPE dropout option)
  • Beam search and greedy decoding (with repetition penalty, ngram blocker)
  • Customizable initialization
  • Attention visualization
  • Learning curve plotting
  • Scoring hypotheses and references
  • Multilingual translation with language tags

Installation

JoeyS2T is built on PyTorch. Please make sure you have a compatible environment. We tested JoeyS2T v2.3 with

  • python 3.11
  • torch 2.1.2
  • torchaudio 2.1.2
  • cuda 12.1

Clone this repository and install via pip:

$ git clone https://github.com/may-/joeys2t.git
$ cd joeys2t
$ python -m pip install -e .
$ python -m unittest

📝 Note You may need to install extra dependencies (torchaudio backends): ffmpeg, sox, soundfile, etc. See torchaudio installation instructions.

Documentation & Tutorials

Please check the JoeyNMT's documentation first, if you are not familiar with JoeyNMT yet.

For details, follow the tutorials in notebooks dir.

Benchmarks & Pretrained models

We provide benchmarks and pretraind models for Speech Recognition (ASR) and Speech Translation (ST) with JoeyS2T.

The models are also available via Torch Hub!

import torch

model = torch.hub.load('may-/joeys2t', 'mustc_v2_ende_st')
translations = model.generate(['test.wav'])
print(translations[0])
# 'Hallo, Welt!'

⚠️ Warning The 1d-conv layer may raise an error for too short audio inputs. (We cannot convolve the frames shorter than the kernel size!)

Reference

If you use JoeyS2T in a publication or thesis, please cite the following paper:

@inproceedings{ohta-etal-2022-joeys2t,
    title = "{JoeyS2T}: Minimalistic Speech-to-Text Modeling with {JoeyNMT}",
    author = "Ohta, Mayumi  and
      Kreutzer, Julia  and
      Riezler, Stefan",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-demos.6",
    pages = "50--59",
}

Contact

Please leave an issue if you have found a bug in the code.

For general questions, email me at ohta <at> cl.uni-heidelberg.de.