Data-Pipeline-Build: A Python repository from in-balamurugan

Data pipeline - CSV to API
Aim of this project is to simulate reading large chunks of CSV files, processing them, and serving as API.

Below are the tools used:

Data frame processing: Dask, Pandas
Data testing: Great expectations
Filetypes: Parquet, Sqlite, CSV
API Serving:Fast API, Uvicorn
API testing: Pytest

To run the project, launch sh run.sh

Further Improvements:

Adding a data orchestrator
Wrapping with a container

Input is book_train.parquet folder to be stored in 'data' folder downloaded from here

in-balamurugan/Data-Pipeline-Build