Transform your spoken words into text in real-time with our Live Transcription Script powered by Hugging Face's Transformers and the state-of-the-art openai/whisper-large-v3
model. Whether you're conducting interviews, lectures, or need accessibility features, this Python script provides a seamless and efficient solution for live speech-to-text conversion.
- High-Accuracy Transcription: Leverages the
openai/whisper-large-v3
model for precise and reliable speech recognition. - Real-Time Processing: Captures and transcribes audio live, displaying results instantly in the console.
- Flexible Logging:
- Verbose Mode: Activate detailed DEBUG logs for in-depth monitoring.
- Log File Support: Save logs to a specified file for later review.
- Device Management:
- List Audio Devices: Easily view all available audio input devices.
- Select Preferred Device: Choose your desired microphone or audio input source.
- Advanced Audio Handling:
- Stereo to Mono Conversion: Automatically converts multi-channel audio to mono for consistency.
- Resampling: Adjusts audio sampling rates to match model requirements.
- Audio Buffering: Efficiently captures complete speech segments by buffering audio chunks.
- Debug Audio Saving: Optionally save audio chunks for troubleshooting and analysis.
- Transcription Output:
- Console Display: View transcriptions in real-time directly in your terminal.
- File Export: Automatically save transcriptions to a designated text file.
- Training Data Collection (Optional):
- Save Audio for Training: Store audio files alongside their transcriptions to build custom datasets.
- JSONL Mapping: Maintain a structured mapping of audio files to their corresponding transcriptions.
Whether you're building applications that require live speech recognition, enhancing accessibility for users, or collecting training data for custom models, this script offers a robust and extensible foundation. Its integration with Hugging Face Transformers ensures you're utilizing cutting-edge technology that's both scalable and adaptable to various use cases.
- Python 3.7 or higher: Ensure you have Python installed. You can download it from the official website.
Install the necessary Python packages using pip
:
pip install transformers sounddevice numpy scipy torch
Note: The sounddevice
library requires appropriate backend support for audio capture on your operating system. Ensure your microphone is properly connected and recognized.
Run the script using Python:
python live_transcription.py [OPTIONS]
Argument | Description | Default |
---|---|---|
--verbose |
Enable verbose (DEBUG) logging to get detailed insights during execution. | False |
--log_file LOG_FILE |
Specify a file path to save logs for future reference and debugging. | None |
--model_id MODEL_ID |
Choose a Hugging Face model ID for transcription. Defaults to openai/whisper-large-v3 for optimal accuracy. |
openai/whisper-large-v3 |
--output_file OUTPUT |
Define the path to save the transcription text output. | live_transcription.txt |
--debug_audio_dir DIR |
Set a directory to save audio chunks for debugging purposes. If not provided, audio chunks won't be saved. | None |
--training_data_dir DIR |
Designate a directory to save audio files for training custom models. | None |
--mapping_file FILE |
Provide a path to save audio-transcription mappings in JSON Lines format. Requires --training_data_dir to be set. |
None |
--samplerate RATE |
Set the sampling rate for audio capture. 16000 Hz is recommended for Whisper models. |
16000 |
--block_duration SEC |
Specify the duration of each audio block in seconds for processing. | 0.5 |
--threshold THRESH |
Define the amplitude threshold to detect speech presence. Adjust based on ambient noise levels. | 0.02 |
--max_silence_duration SEC |
Set the duration of silence (in seconds) to determine the end of a speech segment. | 1.0 |
--device DEVICE |
Select the audio input device by ID or name substring. Use --list_devices to view available options. |
None (default device) |
--list_devices |
Display all available audio input devices and exit. | False |
--target_samplerate RATE |
Specify the target sampling rate for transcription. 16000 Hz is recommended for optimal compatibility with Whisper models. |
16000 |
--enable_diarization |
Speaker diarization to identify multiple speakers during transcription. Uses PyAnnote/speaker-diarization-3.1 | None |
-
Basic Live Transcription:
Transcribe live audio and save the output to
live_transcription.txt
.python live_transcription.py
-
Enable Detailed Logging and Save Logs:
Activate verbose logging and save logs to
transcription.log
.python live_transcription.py --verbose --log_file transcription.log
-
List Available Audio Input Devices:
View all audio input devices connected to your system.
python live_transcription.py --list_devices
-
Select a Specific Audio Input Device:
After identifying your device ID (e.g.,
2
), select it for transcription.python live_transcription.py --device 2
-
Save Debug Audio Chunks:
Store audio chunks in the
./debug_audio
directory for troubleshooting.python live_transcription.py --debug_audio_dir ./debug_audio
-
Collect Training Data with Audio-Transcription Mapping:
Save audio files for training and maintain a mapping in
mappings.jsonl
.python live_transcription.py --training_data_dir ./training_audio --mapping_file ./mappings.jsonl
-
Enable Speaker Diarization:
Identify multiple speakers during transcription using the
--enable_diarization
option.python live_transcription.py --enable_diarization
-
Audio Device Selection: Use the
--list_devices
option to identify the correct device ID or name substring for your audio input device. -
Sampling Rate: The default sampling rate is set to
16000
Hz, which is ideal for Whisper models. Adjust only if necessary based on your specific requirements. -
Interrupting the Script: Press
Ctrl+C
at any time to gracefully stop the live transcription session.
-
No Audio Detected:
- Ensure the correct audio input device is selected.
- Verify that your microphone is properly connected and functioning.
-
High CPU/GPU Usage:
- Transcription models are resource-intensive. Ensure your system meets the necessary hardware requirements.
- Consider using a GPU for enhanced performance and reduced processing time.
-
Dependency Issues:
- Confirm all required packages are installed.
- Use a virtual environment to manage dependencies and avoid conflicts.
-
Audio Quality Issues:
- Adjust the
--threshold
and--samplerate
parameters to better suit your environment and hardware setup.
- Adjust the
Contributions are welcome! Whether it's reporting bugs, suggesting features, or improving documentation, your input helps make this project better.
- Fork the Repository
- Create a Feature Branch:
git checkout -b feature/YourFeature
- Commit Your Changes:
git commit -m 'Add some feature'
- Push to the Branch:
git push origin feature/YourFeature
- Open a Pull Request
This project is licensed under the MIT License. You are free to use, modify, and distribute it as per the license terms.
- Hugging Face Transformers for providing powerful NLP tools.
- OpenAI Whisper for advanced speech recognition capabilities.
- SoundDevice for seamless audio capture in Python.
- Scipy and NumPy for scientific computing essentials.
For any inquiries, suggestions, or support, feel free to open an issue on GitHub.
Embark on a seamless journey from speech to text with our Live Transcription Script, harnessing the latest in machine learning and audio processing technologies!