Data pipeline - CSV to API
Aim of this project is to simulate reading large chunks of CSV files, processing them, and serving as API.

Below are the tools used:

Data frame processing: Dask, Pandas
Data testing: Great expectations
Filetypes: Parquet, Sqlite, CSV
API Serving:Fast API, Uvicorn
API testing: Pytest

To run the project, launch sh run.sh

Further Improvements:

  • Adding a data orchestrator
  • Wrapping with a container

Input is book_train.parquet folder to be stored in 'data' folder downloaded from here