🗣 ⇢ `TalkSee` ⇢ 👀

Software Design Document (SDD)

🗣 ⇢ `Table o'Contents` ⇢ 👀

Introduction
System Overview
Dependencies
Installation
Features
Functionality
Future Enhancements

Introduction

🗣 ⇢ TalkSee ⇢ 👀 is a speech-to-text application that allows users to transcribe audio files or microphone input using the WhisperAI ASR models.

System Overview

GUI

The graphical user interface is powered by Streamlit.

Model Selection

Provides a GUI to to select a WhisperAI ASR model.

Audio Input

Supports two modes of audio input: microphone input and file upload.

Speech Recognition/Transcription

Employs a WhisperAI ASR model to transcribe the user audio input into text.

Text Output

Displays the transcribed text to the user.

Dependencies

The 🗣 ⇢ TalkSee ⇢ 👀 web app relies on the following external libraries and resources:

Python 3.x
os: Provides operating system interface.
time: Provides time functionality.
io: Provides input/output functionality.

Streamlit: Provides the user interface framework;
audio_recorder_streamlit: Provides the audio input stream;
PyTorch: Provides the neural network library for GPU processing;
WhisperAI ASR: Provides the speech recognition functionality;

Installation

Clone the repository:

gh repo clone PedroZappa/TalkSee

Change the current directory to the cloned repository:

cd TalkSee

Install the required packages from the requirements.txt file:

pip install -r requirements.txt

Create a .streamlit/secrets.toml file and add the desired path to MODELS_PATH variable:

touch .streamlit/secrets.toml | echo 'MODELS_PATH="models"' >> .streamlit/secrets.toml

Run Streamlit application:

streamlit run main.py

Features

Streamlit-based user interface for easy interaction.
Select WhisperAI ASR model from the list of available models:

Size	Parameters	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny`	~1 GB	~32x
base	74 M	`base`	~1 GB	~16x
small	244 M	`small`	~2 GB	~6x
medium	769 M	`medium`	~5 GB	~2x
large	1550 M	`large`	~10 GB	1x

Checks if CUDA is available for GPU processing, else runs on CPU.
Support for both microphone input and audio file upload.
Display of the transcribed text to the user.

Functionality

Select WhisperAI ASR model from the available options.
Choose an input mode (Mic or File).
- If using the Mic, click the "microphone-icon" button to start recording audio. The recording will stop automatically after 2 seconds of silence.
- If using File, upload an audio file in .wav, .mp3 or .m4a formats.
Click the Transcribe button to transcribe the audio file.
Display transcribed text in "Transcription" section.

Future Enhancements

Some possible future enhancements for 🗣 ⇢ TalkSee ⇢ 👀 include:

Support for mobile devices.
Support for additional speech recognition models.
Real-time transcription of live audio input.
Integration with cloud storage services for seamless file upload and storage.
Improved error handling and user feedback.
Generate an image with the transcribed text as a prompt.