KAFKA CLUSTERS

An ETL data pipeline to to collect and extract vocal data, transform and load it to an S3 bucket using Kafka clusters, Airflow, and spark for a text to speech conversion project

Project details

Table of contents

Introduction
Overview
Objective
Data
Requirements
Install
Using the application
Frontend
Backend
Screenshots
Notebooks
Scripts
Test
Authors

Introduction

Data is everywhere. In order to get the best out of it one needs to extract it from several sources, make required transformations and load it to a data warehouse for further analysis and explorations. This is where ETL data pipelines come to use.

ETL stands for Extract, Transform and Load. An ETL tool extracts the data from different RDBMS source systems, real-time user interactions, and sever other sorts of transactions. Then the extracted data will be transformed using transformations that are almost always specific to the goal of the project like applying calculations, concatenating, analyzing, aggregating, etc. And then load the data to the data warehouse system. The data is loaded in the DW system in the form of dimension and fact tables, which can serve as the basis for which the business analyzers, bushiness intelligence officers, and machine learning teams can continue to work on with.

Overview

Our client 10 Academy, recognizing the value of large data sets for speech-t0-text data sets and seeing the opportunity that there are many text corpora for Amharic and Swahili languages, and understanding that complex data engineering skills are valuable to our profile for employers, want to have a design and build a robust, large scale, fault-tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file.

Producing a tool that can be deployed to process posting and receiving text and audio files from and into a Kafka topic, apply transformation in a distributed manner, and load it into an S3 bucket in a suitable format to train a speech-to-text model would do the required job.

Objective

The main objective of this week’s project is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and on web platforms

This can be achieved by building an end-to-end ETL data pipeline that will use Apache Kafka, Apache Spark, and Apache Airflow in order to receive user voice audio files, transform them and load them to an S3 bucket that will later be used for text-to-speech conversion machine learning project.

Users will be prompted with several different sentences and they will provide their corresponding audio by recording using the front-end user interface that is provided.

Data

The main data for this task is a text corpus of which the context was a news data context. The data was comprised of mostly news sentences that were written in the Amharic language.

This data is publicly available and can be found here. as a CSV file.

Docker and Docker compose

You can also read a brief description of the data here.

It was initially containing slightly over 51, 400 records. The data also has 6 features. These features were:

Headline: The headline of the news
Category: The category of the news
Date: The date the news was aired
Views: Total number of views of the news
Article: The main body of the news
Link: The link where the news was found

Requirements

Pip

FastApi

Zookeeper

kafka-python

Apache kafka

Apache Spark

React (nodejs)

Apache airflow

Python 3.5 or above

Docker and Docker compose

You can find the full list of requirements in the requirements.txt file

Install

We highly recommend you create a new virtual environment and install every required modules and libraries on the virtual environment.

Installing this application

You can run the front-end by running the following command on the terminal

git clone https://github.com/TenAcademy/Data-Engineering_text-to-speech_data-collection.git
cd Data-Engineering_text-to-speech_data-collection
pip install -r requirements.txt

Examples

Using this application

One can start using the application by first running the front and back ends.
You can run the front-end by running the following command on the terminal
A more detailed instruction regarding the front-end can be found at frontend/readme.md file.

cd frontend
npm run start

You can run the back-end by running the following command on the terminal

cd api
uvicorn app:app --reload

Interacting with the front end

After running the front end, one can simply go over to the browser and type in http://localhost:3000. or click this link
A page similar to this will appear.

Users will then click on the get text button to get a text.
Users will record themselves speaking the generated text out loud by using the recording interface provided.
Finally users will upload their voice that they have recorded byb clicking on the upload button.

Frontend

The front end application can be found here in the frontend folder

Backend

The back end application can be found here in the backend folder

Screenshots

The detailed use and implementation of the pipelines using Apache Airflow, pipeline summary and interaction, kafka clusters, interaction with the topics on the kafka clusters, front-end images and usage can all be found in this screenshots folder as image files.

Notebooks

All the notebooks that are used in this project including EDA, data cleaning and summarization are found here in the Notebooks folder.

Scripts

All the scripts and modules used for this project relating to interactions with the kafka, airflow, spark and other frameworks along with default parameters and values used will be found here, in the scripts folder.

Tests

All the unit and integration tests are found here in the tests folder.

Authors

👤 Birhanu Gebisa

Email, GitHub, LinkedIn

👤 Ekubazgi Gebremariam

Email, GitHub, LinkedIn

👤 Emtinan Salaheldin

Email, GitHub, LinkedIn

👤 Fisseha Estifanos

Email, GitHub, LinkedIn, Twitter

👤 Natnael Masresha

Email, GitHub, LinkedIn, Twitter

👤 Niyomukiza Thamar

Email, GitHub, LinkedIn

Show us your support

Give us a ⭐ if you like this project, and also feel free to contact us at any moment.

niyotham/Data-Engineering_text-to-speech_data-collection-1

KAFKA CLUSTERS

An ETL data pipeline to to collect and extract vocal data, transform and load it to an S3 bucket using Kafka clusters, Airflow, and spark for a text to speech conversion project

Project details

Introduction

Overview

Objective

Data

Requirements

Install

Installing this application

Examples

Using this application

Interacting with the front end

Frontend

Backend

Screenshots

Notebooks

Scripts

Tests

Authors

Show us your support