/The-Speako

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and keyword extraction

Primary LanguageJupyter NotebookMIT LicenseMIT

๐Ÿš€ About Project


Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to process human speech into a written format. The Speako is a Natural Language Processing based project built on top of stack of technologies in order to transcribe the English voice based audio files of any accent. The project further provides the facility to user to provide the Urdu Translation of that transcribed text. Lastly, it also extracts Keywords from that transcribed text.

Features:


  • ๐Ÿคฉ This project will let the users to transcribe there .flac or .wav audio file.
  • ๐Ÿฅณ Translate the transcribtion into Urdu Language.
  • ๐Ÿ˜‹ Extract the Key points of text.

๐Ÿ“‚ Project Organization


โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ Makefile           <- Makefile with commands like `make data` or `make train`
โ”œโ”€โ”€ README.md          <- The top-level README for developers using this project.
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ external       <- Data from third party sources.
โ”‚   โ”œโ”€โ”€ interim        <- Intermediate data that has been transformed.
โ”‚   โ”œโ”€โ”€ processed      <- The final, canonical data sets for modeling.
โ”‚   โ””โ”€โ”€ raw            <- The original, immutable data dump.
โ”‚
โ”œโ”€โ”€ docs               <- A default Sphinx project; see sphinx-doc.org for details
โ”‚
โ”‚
โ”œโ”€โ”€ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
โ”‚                         the creator's initials, and a short `-` delimited description, e.g.
โ”‚                         `1.0-jqp-initial-data-exploration`.
โ”‚
โ”œโ”€โ”€ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
โ”‚                         generated with `pip freeze > requirements.txt`
โ”‚
โ”œโ”€โ”€ src                <- Source code for use in this project.
โ”‚   โ”œโ”€โ”€ __init__.py    <- Makes src a Python module
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ data           <- Scripts to download or generate data
โ”‚   โ”‚   โ””โ”€โ”€ make_dataset.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ features       <- Scripts to turn raw data into features for modeling
โ”‚   โ”‚   โ””โ”€โ”€ build_features.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ models         <- Scripts to train models and then use trained models to make
โ”‚   โ”‚   โ”‚                 predictions
โ”‚   โ”‚   โ”œโ”€โ”€ predict_model.py
โ”‚   โ”‚   โ””โ”€โ”€ train_model.py
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ visualization  <- Scripts to create exploratory and results oriented visualizations
โ”‚       โ””โ”€โ”€ visualize.py
โ”‚
โ””โ”€โ”€ tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

๐Ÿ“ƒ Transcription


Model Selection Steps:

We selected four accurate working Transcription Models and peform evaluation to select the best performer among four of them.

  • Facebook/wav2vec2-large-960h-lv60-self
  • Facebook/wav2vec2-lv60-base
  • Pytorch transformers model
  • Deep search model by Mozilla

๐Ÿ“ˆ Dataset:

  • Audio clips of different English accents were collected several online resources.

๐Ÿ”ฉ Preprocessing Step:

  • Sampled the Audio file at 16 KHz
  • Removed any distortion or background noises from the audio

๐Ÿ’ญEvaluation:

These are the following evaluation metrics which were considered to select the best working model.

  • Word Error Rate for each model
  • Match Error Rate for each model
  • Word Information Loss for each model
  • Word Error Rate for each accent
  • Match Error Rate for each accent
  • Word Information Loss for each accent

All the evaualtion results and meta-data are logged in Neptune AI

Fine-Tuning Steps:

๐Ÿ“ˆ Dataset:

  • TIMIT: is a corpus of phonemically and lexically transcribed speech of American English speakers of different genders and dialects.

๐Ÿ”ฉ Preprocessing Step:

  • Removed irrelivant features from English dataset.In our dataset: phonetic_detail , word_detail, dialect_region, sentence_type, speaker_id)
  • Removed expressions like \ , \ ? \ . \ ! \ - \ ; \ : \ "
  • Sampled the Audio file at 16 KHz

๐Ÿ’ญEvaluation:

  • Use WER (Word Error Rate)

๐ŸŽถ Training:

๐Ÿ”ค Translation


Supported Language:

URDU

  • Translates the transcription into Urdu Language.
  • Model Used: Helsinki-NLP/opus-mt-en-ur
  • In future we will be adding a Pipeline channel to preprocess and generate direct results from it

๐Ÿ“Œ Keywords Extraction


  • Text analysis feature that automatically extracts the most used important words from a transciption. It helps summarize the content of texts and recognize the main topics discussed.
  • Model Used: KeyBERT

๐Ÿ”ฎ User Interface


  • The UI of the project is built using Streamlit.
  • It provides a responsive GUI presenation of the project with their respective model results.

๐Ÿญ Project Pipelining


  • Inference Pipeline using ZenML
  • Fine Tuning Pipeline for English ASR with Transformers

๐Ÿก Developer Setup Guide


โฎ๏ธ Prerequisites

You can run Docker image on your local system using

`docker pull taserx/speako:latest`

`docker run -p 8501:8501 taserx/speako:latest`

`docker exec -it <container_name> bash`

`apt-get update && apt-get install libsndfile1`

For python file:

 `python app.py'

โš’๏ธ Built Upon


- Python
- facebook/wav2vec2-large-960h-lv60-self
- Helsinki-NLP/opus-mt-en-ur
- KeyBERT
- Streamlit
- Docker

๐Ÿ”ง Tools Used


- Visual Studio Code
- Google Colaboratory
- Google Drive Mount
- Neptune AI

๐Ÿ‘ฅ Collaborators


๐Ÿ“‹ License


This project is licensed under the MIT License - see the LICENSE file for details.