whisply 🗿

Transcribe, diarize, annotate and subtitle audio and video files with Whisper ... fast!

whisply combines faster-whisper, insanely-fast-whisper and batch processing of files (with mixed languages). It also enables speaker detection and annotation via pyannote.

Supported output formats: .json .txt .srt .rttm

Requirements
Installation
Usage

Requirements

FFmpeg
python3.11

If you want to use a GPU:

nvidia GPU (CUDA)
Metal Performance Shaders (MPS) → Mac M1-M3

If you want to activate speaker detection / diarization:

HuggingFace access token

Installation

1. Install ffmpeg

--- macOS ---
brew install ffmpeg

--- linux ---
sudo apt-get update
sudo apt-get install ffmpeg

--- Windows ----
https://ffmpeg.org/download.html

2. Clone this repository and change to project folder

git clone https://github.com/th-schmidt/whisply.git
cd whisply

3. Create a Python virtual environment and activate it

python3.11 -m venv venv
source venv/bin/activate

4. Install dependencies with pip

pip install -r requirement.txt

Usage

>>> python whisply_cli.py
Usage: whisply_cli.py [OPTIONS]

  WHISPLY processes audio and video files for transcription, optionally
  enabling speaker diarization and generating .srt subtitles or saving
  transcriptions in .txt format. Default output is a .json file for each input
  file that  saves timestamps and transcripts.

Options:
  --files PATH            Path to file, folder, URL or .list to process.
                          [required]
  --output_dir DIRECTORY  Folder where transcripts should be saved. DEFAULT:
                          "./transcriptions"
  --device [cpu|gpu|mps]  Select the computation device: CPU, GPU (nvidia
                          CUDA), or MPS (Metal Performance Shaders).
  --lang TEXT             Specify the language of the audio for transcription
                          (en, de, fr ...). DEFAULT: None (= auto-detection)
  --detect_speakers       Enable speaker diarization to identify and separate
                          different speakers.
  --hf_token TEXT         HuggingFace Access token required for speaker
                          diarization.
  --srt                   Create .srt subtitles from the transcription.
  --txt                   Create .txt with the transcription.
  --config FILE           Path to configuration file.
  --help                  Show this message and exit.

Speaker Detection

To use --detect_speakers you need to provide a valid HuggingFace access token by using the --hf_token flag. In addition to this you have to accept both pyannote user conditions for version 3.0 and 3.1 of the segmentation model. Follow the instructions in the section Requirements of the pyannote model page on HuggingFace.

Using config files

You can provide a .json config file by using the --config which makes processing more user-friendly. An example config looks like this:

{
    "files": "path/to/files",
    "output_dir": "./transcriptions",
    "device": "cpu",
    "lang": null, 
    "detect_speakers": false,
    "hf_token": "Hugging Face Access Token",
    "txt": true,
    "srt": false
}

Batch processing

Instead of providing a file, folder or URL by using the --files option, you can pass a .list with a mix of files, folders and URLs for processing. Example:

cat my_files.list

video_01.mp4
video_02.mp4
./my_files/
https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo

If you are transcribing multiple files whisply will first detect the language of each file.

JKamlah/whisply