STT-data-collection

The purpose of this challenge is to build a data engineering pipeline that allows recording millions of Amharic speakers reading digital texts in-app and web platforms.

Table of content

Introduction
Installation
Folders
Technologies
Contributers

Introduction

There are many text corpuses for Amharic and Swahili. Our client 10 academy wants to gather vast amount of quality audio data from diffrent applications by displaying text corpus and record users reading the displayed text. And build robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file.

Installation

kafka installation
airflow installation
spark installation

Folders

data :
notebooks :
scripts :
tests :

Technolologies

Apache Kafka :
Apache Airflow :
Apache Spark :

Contributers

Milky Bekele
Bethelhem Sisay
Natnael Sisay
Chimdessa_Tesfaye
Harriet_Sibitenda
Luel
Michael Tekle
Mizan_Abaynew