/TalkTales

Work in progress tool designed to assist deaf and hard-of-hearing individuals in their daily interactions

Primary LanguagePythonMIT LicenseMIT

TalkTales

Logo

Table of Contents

  1. Introduction
  2. Motivation
  3. Technologies
  4. Getting Started
  5. Usage
  6. Roadmap
  7. Authors
  8. License

Introduction

TalkTales is a specialized tool designed to assist deaf and hard-of-hearing individuals in their daily interactions. Unlike conventional speech-to-text services, TalkTales goes a step further by providing contextual understanding through speaker differentiation.

Motivation

The primary objective of this project is to develop a specialized tool aimed at assisting deaf individuals in their day-to-day interactions. While the overarching goal is to convert spoken language into text, the unique aspect of this project lies in its approach to speaker differentiation. By highlighting changes in the speaker's voice within the transcribed text, our tool aims to offer an enhanced contextual understanding, a feature often missing in traditional speech-to-text services.

Technologies

We leverage the open-source Vosk model for the core speech-to-text translation. However, our methodology diverges from mainstream solutions, as we are intent on reducing our dependence on machine learning algorithms. The goal is not merely to create a functional tool but to deepen our understanding of sound and voice phenomena. Most of the concepts are explained in detail inside the docs directory.

The development of the repository is currently halted due to high intensity of university tasks

Getting Started

Prerequisites

  • Python 3.10+ installed
  • git installed

Installation

First, clone the repository to your local machine:

git clone https://github.com/kryczkal/TalkTales.git ; cd TalkTales

Before installation of the dependencies, it is highly recommended to set up a virtual environment inside the project

python -m venv `myPythonEnv`
source `myPythonEnv`/bin/activate

Then install the python dependencies with pip

pip install -r requirements.txt

You are ready to go

Usage

Main Application

Run the application with

python main.py

or on linux

./main.py

Utilities

Various utilities are also provided alongside the main app. These include:

DiarizationTester

Program used to invoke Diarizers components without application frontend. It either loads an audio file or connects to live stream. It should then write speaker changes to stdout. If set to plot (With Settings file), it will also make plot of speakers across time.

Usage
python DiarizationTester.py [optional: filename]

or

./DiarizationTester.py [optional: filename]

Suggestions Library

The "Suggestions" folder serves as an experimental solution to address the challenges posed by artifacts in speech-to-text algorithms. These algorithms, while impressive, are not flawless. Even minor errors can significantly impede the smoothness of a conversation. To mitigate this, we've implemented a secondary layer of security that scans the transcribed sentences for anomalies. When it identifies an 'unlikely' word or phrase, it flags it as such and offers a more probable alternative. This enhancement leverages the Herbert language model. Unfortunately this approach has a limitation: it's too slow for real-time applications.

It can be tested with:

python src/suggestions/testing.py

or

./src/suggestions/testing.py

The script will ask for input sentences, and struggle to find improbable utterances in it, and suggest improvements.

Matlab Folder

The "Matlab" folder contains a suite of streamlined scripts designed specifically for the acquisition of sound samples across diverse settings. These scripts are integral to our broader project, which aims to analyze human speech in various environments and build a speech diarization tool. To this end, we generated a comprehensive range of auditory visualizations, including wave plots, spectrograms, and mel spectrograms. During the initial phases of our research and development, these visualizations functioned as foundational references, enhancing our understanding of the acoustics and patterns of human speech.

Roadmap

  • Simple diarization model
  • Multithreaded backend design
  • Diarization model tuning and upgrades
  • Multi-language support
  • More detailed examples
  • Android application front-end

Authors

License

Distributed under the MIT License. See LICENSE.txt for more information.