/STT-data-collection

A data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts on app and web platforms.

Primary LanguageJupyter Notebook

STT-data-collection

The purpose of this challenge is to build a data engineering pipeline that allows recording millions of Amharic speakers reading digital texts in-app and web platforms.

Table of content

Introduction

There are many text corpuses for Amharic and Swahili. Our client 10 academy wants to gather vast amount of quality audio data from diffrent applications by displaying text corpus and record users reading the displayed text. And build robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file.

Installation

  • kafka installation
  • airflow installation
  • spark installation

Folders

  • data :
  • notebooks :
  • scripts :
  • tests :

Technolologies

Contributers