Data pipeline - CSV to API
Aim of this project is to simulate reading large chunks of CSV files, processing them, and serving as API.
Below are the tools used:
Data frame processing: Dask, Pandas
Data testing: Great expectations
Filetypes: Parquet, Sqlite, CSV
API Serving:Fast API, Uvicorn
API testing: Pytest
To run the project, launch sh run.sh
Further Improvements:
- Adding a data orchestrator
- Wrapping with a container
Input is book_train.parquet folder to be stored in 'data' folder downloaded from here