Linux Dictation

⚠️ Important Notice: This repository is provided as-is, without active maintenance or support. While the code is functional, I cannot provide fixes or updates. Users are welcome to fork the repository and make their own modifications. Pull requests are welcome but must be thoroughly tested and documented.

A real-time speech-to-text dictation tool for Linux, powered by Whisper models (local or cloud), with support for ElevenLabs TTS and Ollama LLM integration. While primarily tested on Fedora 40 (the distribution used by Linus Torvalds himself), it should theoretically work on any Linux distribution with the proper dependencies installed.

Demo

demo-2024-12-15.14-57-18.mp4

Watch the demo above to see Linux Dictation in action, featuring real-time speech-to-text, voice commands, and AI-powered text improvements.

Features

Real-time speech-to-text conversion using Whisper
Text-to-Speech capabilities via ElevenLabs
LLM-powered chat mode and text improvement using Ollama
Support for multiple Whisper models
Voice activity detection for improved accuracy
Automatic text insertion into active window using ydotool or xdotool
Configurable voice commands and ignored phrases
Multiple operation modes: dictation, chat, and proofreading

Requirements

Linux (primarily tested on Fedora 40)
Python 3.11 or higher
PortAudio development libraries
ydotool or xdotool for text input
NVIDIA GPU (optional, for GPU acceleration)
Ollama (optional, for LLM features)
ElevenLabs API key (optional, for TTS features)
Whisper instance (can be local Docker container, OpenAI API, or any compatible endpoint)

Installation

Clone the repository:

git clone https://github.com/mysticaltech/linux_dictation.git
cd linux_dictation

Install system dependencies:

sudo dnf install python3-pip python3-devel portaudio-devel ydotool xdotool

Install Poetry (if not already installed):

curl -sSL https://install.python-poetry.org | python3 -

Install project dependencies:
```
poetry install
```

Configure environment variables:

cp .env.example .env
# Edit .env with your API keys and preferences

Configuration

Environment Variables (.env)

WHISPER_MODEL: Choose the Whisper model (default: "faster-distil-whisper-large-v3")
WHISPER_BASE_URL: Whisper API endpoint (can be local Docker container, OpenAI API, or any compatible service)
ELEVENLABS_API_KEY: Your ElevenLabs API key
ELEVENLABS_VOICE_ID: Your chosen ElevenLabs voice ID
OLLAMA_API_URL: Ollama API endpoint
OLLAMA_MODEL: Your chosen Ollama model
OLLAMA_TIMEOUT: API timeout in seconds

Usage

Start the dictation service:
```
./start.sh
```
Or manually:
```
poetry run python main.py
```
Available voice commands:
- "pause dictation" - Pause transcription
- "resume dictation" - Resume transcription
- "chat mode" - Switch to interactive LLM chat mode
- "dictation mode" - Switch to standard dictation mode
- "read aloud" - TTS reading of selected text
- "make awesome" - Improve selected text using LLM
Operation Modes:
- Dictation Mode: Standard speech-to-text
- Chat Mode: Interactive conversations with LLM
- Proofreading Mode: Text improvement and suggestions
Press Ctrl+C in the terminal to stop the application.

Advanced Features

Text-to-Speech (TTS)

Requires ElevenLabs API key
Supports reading selected text aloud
Configurable voice and model settings

LLM Integration

Requires Ollama installation
Supports chat mode for interactive conversations
Text improvement and proofreading capabilities

Input Methods

Primary: ydotool for Wayland support
Fallback: xdotool for X11 compatibility

Troubleshooting

Audio Input Issues:
- Check microphone settings in system settings
- Verify microphone permissions
- Test microphone with pavucontrol
Text Input Problems:
- Check ydotool service status
- Verify xdotool installation
- Check input method compatibility
LLM/TTS Issues:
- Verify API keys in .env
- Check Ollama service status
- Confirm network connectivity

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

mysticaltech/linux_dictation