Live Transcription with Hugging Face Transformers

Transform your spoken words into text in real-time with our Live Transcription Script powered by Hugging Face's Transformers and the state-of-the-art openai/whisper-large-v3 model. Whether you're conducting interviews, lectures, or need accessibility features, this Python script provides a seamless and efficient solution for live speech-to-text conversion.

🚀 Features

High-Accuracy Transcription: Leverages the openai/whisper-large-v3 model for precise and reliable speech recognition.
Real-Time Processing: Captures and transcribes audio live, displaying results instantly in the console.
Flexible Logging:
- Verbose Mode: Activate detailed DEBUG logs for in-depth monitoring.
- Log File Support: Save logs to a specified file for later review.
Device Management:
- List Audio Devices: Easily view all available audio input devices.
- Select Preferred Device: Choose your desired microphone or audio input source.
Advanced Audio Handling:
- Stereo to Mono Conversion: Automatically converts multi-channel audio to mono for consistency.
- Resampling: Adjusts audio sampling rates to match model requirements.
- Audio Buffering: Efficiently captures complete speech segments by buffering audio chunks.
- Debug Audio Saving: Optionally save audio chunks for troubleshooting and analysis.
Transcription Output:
- Console Display: View transcriptions in real-time directly in your terminal.
- File Export: Automatically save transcriptions to a designated text file.
Training Data Collection (Optional):
- Save Audio for Training: Store audio files alongside their transcriptions to build custom datasets.
- JSONL Mapping: Maintain a structured mapping of audio files to their corresponding transcriptions.

🌟 Why Choose This Script?

Whether you're building applications that require live speech recognition, enhancing accessibility for users, or collecting training data for custom models, this script offers a robust and extensible foundation. Its integration with Hugging Face Transformers ensures you're utilizing cutting-edge technology that's both scalable and adaptable to various use cases.

📦 Installation

🛠 Prerequisites

Python 3.7 or higher: Ensure you have Python installed. You can download it from the official website.

📥 Dependencies

Install the necessary Python packages using pip:

pip install transformers sounddevice numpy scipy torch

Note: The sounddevice library requires appropriate backend support for audio capture on your operating system. Ensure your microphone is properly connected and recognized.

📝 Usage

Run the script using Python:

python live_transcription.py [OPTIONS]

🎛 Command-Line Arguments

Argument	Description	Default
`--verbose`	Enable verbose (DEBUG) logging to get detailed insights during execution.	`False`
`--log_file LOG_FILE`	Specify a file path to save logs for future reference and debugging.	`None`
`--model_id MODEL_ID`	Choose a Hugging Face model ID for transcription. Defaults to `openai/whisper-large-v3` for optimal accuracy.	`openai/whisper-large-v3`
`--output_file OUTPUT`	Define the path to save the transcription text output.	`live_transcription.txt`
`--debug_audio_dir DIR`	Set a directory to save audio chunks for debugging purposes. If not provided, audio chunks won't be saved.	`None`
`--training_data_dir DIR`	Designate a directory to save audio files for training custom models.	`None`
`--mapping_file FILE`	Provide a path to save audio-transcription mappings in JSON Lines format. Requires `--training_data_dir` to be set.	`None`
`--samplerate RATE`	Set the sampling rate for audio capture. `16000` Hz is recommended for Whisper models.	`16000`
`--block_duration SEC`	Specify the duration of each audio block in seconds for processing.	`0.5`
`--threshold THRESH`	Define the amplitude threshold to detect speech presence. Adjust based on ambient noise levels.	`0.02`
`--max_silence_duration SEC`	Set the duration of silence (in seconds) to determine the end of a speech segment.	`1.0`
`--device DEVICE`	Select the audio input device by ID or name substring. Use `--list_devices` to view available options.	`None` (default device)
`--list_devices`	Display all available audio input devices and exit.	`False`
`--target_samplerate RATE`	Specify the target sampling rate for transcription. `16000` Hz is recommended for optimal compatibility with Whisper models.	`16000`
`--enable_diarization`	Speaker diarization to identify multiple speakers during transcription. Uses PyAnnote/speaker-diarization-3.1	`None`

📚 Examples

Basic Live Transcription:

Transcribe live audio and save the output to live_transcription.txt.
```
python live_transcription.py
```
Enable Detailed Logging and Save Logs:

Activate verbose logging and save logs to transcription.log.
```
python live_transcription.py --verbose --log_file transcription.log
```
List Available Audio Input Devices:

View all audio input devices connected to your system.
```
python live_transcription.py --list_devices
```
Select a Specific Audio Input Device:

After identifying your device ID (e.g., 2), select it for transcription.
```
python live_transcription.py --device 2
```
Save Debug Audio Chunks:

Store audio chunks in the ./debug_audio directory for troubleshooting.
```
python live_transcription.py --debug_audio_dir ./debug_audio
```
Collect Training Data with Audio-Transcription Mapping:

Save audio files for training and maintain a mapping in mappings.jsonl.
```
python live_transcription.py --training_data_dir ./training_audio --mapping_file ./mappings.jsonl
```
Enable Speaker Diarization:

Identify multiple speakers during transcription using the --enable_diarization option.
```
python live_transcription.py --enable_diarization
```

📌 Notes

Audio Device Selection: Use the --list_devices option to identify the correct device ID or name substring for your audio input device.
Sampling Rate: The default sampling rate is set to 16000 Hz, which is ideal for Whisper models. Adjust only if necessary based on your specific requirements.
Interrupting the Script: Press Ctrl+C at any time to gracefully stop the live transcription session.

🛠 Troubleshooting

No Audio Detected:
- Ensure the correct audio input device is selected.
- Verify that your microphone is properly connected and functioning.
High CPU/GPU Usage:
- Transcription models are resource-intensive. Ensure your system meets the necessary hardware requirements.
- Consider using a GPU for enhanced performance and reduced processing time.
Dependency Issues:
- Confirm all required packages are installed.
- Use a virtual environment to manage dependencies and avoid conflicts.
Audio Quality Issues:
- Adjust the --threshold and --samplerate parameters to better suit your environment and hardware setup.

📖 Contributing

Contributions are welcome! Whether it's reporting bugs, suggesting features, or improving documentation, your input helps make this project better.

Fork the Repository
Create a Feature Branch: git checkout -b feature/YourFeature
Commit Your Changes: git commit -m 'Add some feature'
Push to the Branch: git push origin feature/YourFeature
Open a Pull Request

📄 License

This project is licensed under the MIT License. You are free to use, modify, and distribute it as per the license terms.

🙏 Acknowledgements

Hugging Face Transformers for providing powerful NLP tools.
OpenAI Whisper for advanced speech recognition capabilities.
SoundDevice for seamless audio capture in Python.
Scipy and NumPy for scientific computing essentials.

📫 Contact

For any inquiries, suggestions, or support, feel free to open an issue on GitHub.

Embark on a seamless journey from speech to text with our Live Transcription Script, harnessing the latest in machine learning and audio processing technologies!

shoutsid/live-transcription-transformers