Playground.Demo.mp4
- Have
Conda
andYarn
on your device - Clone or fork this repository
- Install the backend and frontend environment
sh install_playground.sh
- Review config.py to make sure the transcription device and compute type match your setup
- Run the backend
cd backend && python server.py
- In a different terminal, run the React frontend
cd interface && yarn start
This repository uses libraries based on pyannote.audio models, which are stored in the Hugging Face Hub. You must accept their terms of use before using them. Note: You need to have a Hugging Face account to use pyannote
- Accept terms for the
pyannote/segmentation
model - Accept terms for the
pyannote/embedding
model - Accept terms for the
pyannote/speaker-diarization
model - Install huggingface-cli and log in with your user access token (can be found in Settings -> Access Tokens)
- Model Size: Choose the model size, from tiny to large-v2.
- Language: Select the language you will be speaking in.
- Transcription Timeout: Set the number of seconds the application will wait before transcribing the current audio data.
- Beam Size: Adjust the number of transcriptions generated and considered, which affects accuracy and transcription generation time.
- Transcription Method: Choose "real-time" for real-time diarization and transcriptions, or "sequential" for periodic transcriptions with more context.
- On MacOS, if building the wheel for safetensors fails, install Rust
brew install rust
and try again.
- In the sequential mode, there may be uncontrolled speaker swapping.
- In real-time mode, audio data not meeting the transcription timeout won't be transcribed.
- Speechless batches may cause hallucinations.
This repository hasn't been tested for all languages; please create an issue if you encounter any problems.
This repository and the code and model weights of Whisper are released under the MIT License.