/linux_dictation

Transform your Linux system into a powerful voice-controlled workstation with real-time dictation, AI-powered chat, and text enhancement capabilities.

Primary LanguagePythonMIT LicenseMIT

Linux Dictation

⚠️ Important Notice: This repository is provided as-is, without active maintenance or support. While the code is functional, I cannot provide fixes or updates. Users are welcome to fork the repository and make their own modifications. Pull requests are welcome but must be thoroughly tested and documented.

A real-time speech-to-text dictation tool for Linux, powered by Whisper models (local or cloud), with support for ElevenLabs TTS and Ollama LLM integration. While primarily tested on Fedora 40 (the distribution used by Linus Torvalds himself), it should theoretically work on any Linux distribution with the proper dependencies installed.

Demo

demo-2024-12-15.14-57-18.mp4

Watch the demo above to see Linux Dictation in action, featuring real-time speech-to-text, voice commands, and AI-powered text improvements.

Features

  • Real-time speech-to-text conversion using Whisper
  • Text-to-Speech capabilities via ElevenLabs
  • LLM-powered chat mode and text improvement using Ollama
  • Support for multiple Whisper models
  • Voice activity detection for improved accuracy
  • Automatic text insertion into active window using ydotool or xdotool
  • Configurable voice commands and ignored phrases
  • Multiple operation modes: dictation, chat, and proofreading

Requirements

  • Linux (primarily tested on Fedora 40)
  • Python 3.11 or higher
  • PortAudio development libraries
  • ydotool or xdotool for text input
  • NVIDIA GPU (optional, for GPU acceleration)
  • Ollama (optional, for LLM features)
  • ElevenLabs API key (optional, for TTS features)
  • Whisper instance (can be local Docker container, OpenAI API, or any compatible endpoint)

Installation

  1. Clone the repository:

    git clone https://github.com/mysticaltech/linux_dictation.git
    cd linux_dictation
  2. Install system dependencies:

    sudo dnf install python3-pip python3-devel portaudio-devel ydotool xdotool
  3. Install Poetry (if not already installed):

    curl -sSL https://install.python-poetry.org | python3 -
  4. Install project dependencies:

    poetry install
  5. Configure environment variables:

    cp .env.example .env
    # Edit .env with your API keys and preferences

Configuration

Environment Variables (.env)

  • WHISPER_MODEL: Choose the Whisper model (default: "faster-distil-whisper-large-v3")
  • WHISPER_BASE_URL: Whisper API endpoint (can be local Docker container, OpenAI API, or any compatible service)
  • ELEVENLABS_API_KEY: Your ElevenLabs API key
  • ELEVENLABS_VOICE_ID: Your chosen ElevenLabs voice ID
  • OLLAMA_API_URL: Ollama API endpoint
  • OLLAMA_MODEL: Your chosen Ollama model
  • OLLAMA_TIMEOUT: API timeout in seconds

Usage

  1. Start the dictation service:

    ./start.sh

    Or manually:

    poetry run python main.py
  2. Available voice commands:

    • "pause dictation" - Pause transcription
    • "resume dictation" - Resume transcription
    • "chat mode" - Switch to interactive LLM chat mode
    • "dictation mode" - Switch to standard dictation mode
    • "read aloud" - TTS reading of selected text
    • "make awesome" - Improve selected text using LLM
  3. Operation Modes:

    • Dictation Mode: Standard speech-to-text
    • Chat Mode: Interactive conversations with LLM
    • Proofreading Mode: Text improvement and suggestions
  4. Press Ctrl+C in the terminal to stop the application.

Advanced Features

Text-to-Speech (TTS)

  • Requires ElevenLabs API key
  • Supports reading selected text aloud
  • Configurable voice and model settings

LLM Integration

  • Requires Ollama installation
  • Supports chat mode for interactive conversations
  • Text improvement and proofreading capabilities

Input Methods

  • Primary: ydotool for Wayland support
  • Fallback: xdotool for X11 compatibility

Troubleshooting

  • Audio Input Issues:

    • Check microphone settings in system settings
    • Verify microphone permissions
    • Test microphone with pavucontrol
  • Text Input Problems:

    • Check ydotool service status
    • Verify xdotool installation
    • Check input method compatibility
  • LLM/TTS Issues:

    • Verify API keys in .env
    • Check Ollama service status
    • Confirm network connectivity

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements