/Speech_to_text_data_collector

This repo is about a Data Engineering task which involves a Speech-to-Text data collection with Kafka, Airflow, and Spark.

Primary LanguageJupyter NotebookMIT LicenseMIT

Speech-to-Text Data Collection

African language Speech Recognition - Speech-to-Text

Forks Badge Pull Requests Badge Issues Badge GitHub contributors License Badge

Table of content

Introduction

The purpose of this project is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms. For this project, the Amharic news text classification dataset with baseline performance dataset is used. The aim of this project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Pipeline

This is our pipeline of this project that will be used to record millions of Amharic and Swahili speakers reading digital texts in-app and web platforms.
Speech-to-text data collection

Project Structure

There are several files in the repository, including Python scripts, Jupyter notebooks,  and text files. 

Installation

git clone https://github.com/STT-Data-Engineering/Speech_to_text

Contributors

contributors list

Made with contrib.rocks