/cog-whisper-diarization

Cog implementation of transcribing + diarization pipeline with Whisper & Pyannote

Primary LanguagePython

Cog Whisper Diarization

Audio transcribing + diarization pipeline.

Models used

  • Whisper Large v3 (CTranslate 2 version faster-whisper)
  • Pyannote audio 3.1.1

Usage

  • Used at Audiogest
  • Or try at Replicate
  • Or deploy yourself at Replicate (Make sure to add your own HuggingFace API key and accept the terms of use of the pyannote models used)

Input

  • file_string: str: Either provide a Base64 encoded audio file.
  • file_url: str: Or provide a direct audio file URL.
  • file: Path: Or provide an audio file.
  • group_segments: bool: Group segments of the same speaker shorter than 2 seconds apart. Default is True.
  • num_speakers: int: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.
  • language: str: Language of the spoken words as a language code like 'en'. Leave empty to auto detect language.
  • prompt: str: Vocabulary: provide names, acronyms, and loanwords in a list. Use punctuation for best accuracy.
  • offset_seconds: int: Offset in seconds, used for chunked inputs. Default is 0.

Output

  • segments: List[Dict]: List of segments with speaker, start and end time.
  • num_speakers: int: Number of speakers (detected, unless specified in input).
  • language: str: Language of the spoken words as a language code like 'en' (detected, unless specified in input).

Thanks to