Go from raw audio files to a speaker separated text-audio datases automatically.
This repo takes a directory of audio files and converts them to a text-audio dataset with normalized distribution of audio lengths. See AnalyzeDataset.ipynb
for examples of the dataset distributions across audio and text length
The output is a text-audio dataset that can be used for training a speech-to-text model or text-to-speech. Currently the code only supports single speaker audio files.
The dataset structure is as follows:
│── /dataset
│ ├── metadata.txt
│ └── wavs/
│ ├── audio1.wav
│ └── audio2.wav
metadata.txt
peters_0.wav|Beautiful is better than ugly.
peters_1.wav|Explicit is better than implicit.
- Audio files are automatically split by speakers
- Speakers are auto-labeled across the files
- Audio splits on silences
- Audio splitting is configurable
- The dataset creation is done so that it follows Gaussian-like distributions on clip length. Which, in turn, can lead to Gaussian-like distributions on the rest of the dataset statistics. Of course, this is highly dependent on your audio sources.
- Leverages the GPUs available on your machine. GPUs also be set explicitly if you only want to use some.
- Clone the repo
git clone https://github.com/miguelvalente/whisperer.git
- Install the dependencies
- Install Poetry
cd whisperer
poetry install
poetry shell
- Create data folder and move audio files to it
mkdir data
mkdir data/audio_files
-
Commands can be called individually or sequentially
- Convert
python -m main convert
- Diarize (requires converted to be called first)
python -m diarize
- Auto-Label (requires diarize to be called first)
python -m auto-label number_of_speakers_present_in_your_audio_file_
- Transcribe (requires converted to be called first)
python -m transcribe your_dataset_name
- Convert & Diarize & Auto-Label & Transcribe
python main.py convert diarize auto-label 6 transcribe your_dataset_name
- Convert
-
Use the
AnalyseDataset.ipynb
notebook to visualize the distribution of the dataset -
Use the
AnalyseSilence.ipynb
notebook to experiment with silence detection configuration
The code automatically detects how many GPU's are available and distributes the audio files in data/audio_files_wav
evenly across the GPUs.
The automatic detection is done through nvidia-smi
.
You can to make the available GPU's explicit by setting the environment variable CUDA_AVAILABLE_DEVICES
.
Modify config.py
file to change the parameters of the dataset creation.
- Speech Diarization