Transcribe, diarize, annotate and subtitle audio and video files with Whisper ... fast!
whisply combines faster-whisper, insanely-fast-whisper and batch processing of files (with mixed languages). It also enables speaker detection and annotation via pyannote.
Supported output formats: .json
.txt
.srt
.rttm
- FFmpeg
- python3.11
If you want to use a GPU:
- nvidia GPU (CUDA)
- Metal Performance Shaders (MPS) → Mac M1-M3
If you want to activate speaker detection / diarization:
- HuggingFace access token
1. Install ffmpeg
--- macOS ---
brew install ffmpeg
--- linux ---
sudo apt-get update
sudo apt-get install ffmpeg
--- Windows ----
https://ffmpeg.org/download.html
2. Clone this repository and change to project folder
git clone https://github.com/th-schmidt/whisply.git
cd whisply
3. Create a Python virtual environment and activate it
python3.11 -m venv venv
source venv/bin/activate
4. Install dependencies with pip
pip install -r requirement.txt
>>> python whisply_cli.py
Usage: whisply_cli.py [OPTIONS]
WHISPLY processes audio and video files for transcription, optionally
enabling speaker diarization and generating .srt subtitles or saving
transcriptions in .txt format. Default output is a .json file for each input
file that saves timestamps and transcripts.
Options:
--files PATH Path to file, folder, URL or .list to process.
[required]
--output_dir DIRECTORY Folder where transcripts should be saved. DEFAULT:
"./transcriptions"
--device [cpu|gpu|mps] Select the computation device: CPU, GPU (nvidia
CUDA), or MPS (Metal Performance Shaders).
--lang TEXT Specify the language of the audio for transcription
(en, de, fr ...). DEFAULT: None (= auto-detection)
--detect_speakers Enable speaker diarization to identify and separate
different speakers.
--hf_token TEXT HuggingFace Access token required for speaker
diarization.
--srt Create .srt subtitles from the transcription.
--txt Create .txt with the transcription.
--config FILE Path to configuration file.
--help Show this message and exit.
To use --detect_speakers
you need to provide a valid HuggingFace access token by using the --hf_token
flag. In addition to this you have to accept both pyannote
user conditions for version 3.0 and 3.1 of the segmentation model. Follow the instructions in the section Requirements of the pyannote model page on HuggingFace.
You can provide a .json config file by using the --config
which makes processing more user-friendly. An example config looks like this:
{
"files": "path/to/files",
"output_dir": "./transcriptions",
"device": "cpu",
"lang": null,
"detect_speakers": false,
"hf_token": "Hugging Face Access Token",
"txt": true,
"srt": false
}
Instead of providing a file, folder or URL by using the --files
option, you can pass a .list
with a mix of files, folders and URLs for processing. Example:
cat my_files.list
video_01.mp4
video_02.mp4
./my_files/
https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo
If you are transcribing multiple files whisply will first detect the language of each file.