/kafka-airflow-spark-pipeline

Text-to-speech data collection with Kafka, Airflow, and Spark

Primary LanguageJupyter NotebookMIT LicenseMIT

kafka-airflow-spark-pipeline

Text-to-speech data collection with Kafka, Airflow, Spark and S3 bucket.

Table of Content

Project overview

In this project design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file and produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.

workflow

Data

There are a number of large text corpora we will use, but for the purpose of testing the backend development, you can use the recently released Amharic news text classification dataset with baseline performance dataset:

IsraelAbebe/An-Amharic-News-Text-classification-Dataset: An Amharic News Text classification Dataset (github.com)

Alternative data Ready-made Amharic data collected from different sources here

Frontend

image

Installation Guide

LICENCE

MIT

Contributors