The codebase is for EACL-finding paper Generating Synthetic Speech from SpokenVocab for Speech Translation. SpokenVocab is proposed as an alternative to TTS. It's a simple, fast and effective data augmentation that converts text to speech on the fly. Our experimental results show that synthetic speech generated via SpokenVocab works as well as TTS systems.
Required package:
pip install pydub numpy
SpokenVocab:
Stitching speech works for any languages as long as the language is supported by a TTS system (e.g, Google TTS, Azure TTS). You may also generate speeches for your vocabulary.
Required files (take English as an example):
- word vocabulary:
vocab/vocab.en.txt
- SpokenVocab bank: download speeches from this link for the above word vocabulary and unpact it under
voice
Structure of vocab
(you may prepare a similar structure for your language or vocabulary)
vocab
├── vocab.en.txt
├── vocab.be.txt
├── vocab.en-be.txt
├── audio
│ ├── en
│ │ ├── spk0 # spk# contains speeches for vocab.en.txt
│ │ ├── spk1
│ │ ├── spk2
│ │ └── ...
The output is saved under output/spk#/lang/crossfade
and wav files are named by timestamps.
Currently supported languages and speakers
- English
- Vocab size: 36765
- Ten speakers
- Bengali
- Vocab size: 33821
- One speaker
Note that in the case of out-of-vocabulary words, the default tokens ("a" for English and "অইচি" Bengali) are used.
You can use SpokenVocab directly in your python code, shown as below:
# you would want to put the file and vocab under the same directory
from generate_speech import SpokenVocab, generate_stitched_voice
# set arguments
voice_root = "vocab/audio"
lng = "en"
wrd_vocab_fpath = f"vocab/vocab.{lng}.txt"
num_spk = 10
chosen_spk = "spk0"
crossfade = 100 # use this to control the "smoonthess" of the stitched speech
save_audio = True # or False if you don't want to save the wav file
save_path = "output"
# input sentence
sentence = "Nice to meet you all!"
# convert the above sentence to speech with spk0
# save the wav to save_path/chosen_spk
spok_vocab = SpokenVocab(voice_root, wrd_vocab_fpath, lng=lng, num_spk=num_spk)
wav = generate_stitched_voice(sentence, spok_vocab, spk=chosen_spk, save_audio=save_audio,
save_path=save_path, crossfade=crossfade)
# set arguments
voice_root = "vocab/audio"
lngs = ["en", "be"]
wrd_vocab_fpaths = [f"vocab/vocab.{lngs[0]}.txt", f"vocab/vocab.{lngs[1]}.txt"]
num_spks = [1, 1]
chosen_spk = ["spk0", "spk0"]
save_audio = True
save_path = "output"
crossfade = 100
# input sentence
sentences = "Nice to সাক্ষাৎ করে ভালো লাগলো"
# convert code-switched text to speech
spok_vocab = MixSpokenVocab(voice_root, wrd_vocab_fpaths, lngs=lngs, num_spks=num_spks)
wav = generate_stitched_mix_voice(sentence, spok_vocab, spks=chosen_spk, save_audio=save_audio,
save_path=save_path, crossfade=crossfade)
You may also retrieve the path of a speech given a word and a speaker:
word = "Morning"
spc_path = spok_vocab.get_speech(wrd=word, spk=chosen_spk)
You can use the code directly from the terminal in case you want to generate speech per a text file (e.g.,input.txt
).
bash run_spok_vocab.sh input.txt
Please cite if you use our code
@article{zhao2022generating,
title={Generating Synthetic Speech from SpokenVocab for Speech Translation},
author={Zhao, Jinming and Haffar, Gholamreza and Shareghi, Ehsan},
journal={arXiv preprint arXiv:2210.08174},
year={2022}
}