This repository contains my solution to the "llama" challenge, hosted in Kaggle.
First, download the code (and its submodules):
git clone --recurse-submodules git@github.com:IamGianluca/...
For reproducibility, we included a Docker image we used to develop and test the application. We defined the Machine Learning pipeline in DVC, a version control system for machine learning projects.
You must copy your personal kaggle.json
file to the project's main directory. This file is used to authenticate to the Kaggle API, and download the competition data from inside the Docker container.
cp ~/.kaggle/kaggle.json .
Build the Docker image and start the Docker container.
make build && make start
Start an interactive bash shell in the container.
make attach
Reproduce the DVC pipeline.
dvc repro
Here is a brief description of what each folder contains:
ckpt
: model checkpointsdata
: input and pre-processed datamtrc
: metricsnbs
: notebooks for exploration analysespipe
: Python scripts for each step in the DVC pipelinepred
: predictionsblazingai
: source code for companion library
Other important files are:
dvc.yaml
: list input, output, and parameters used by each DVC stepparams.yaml
: parameters used for DVC steps
The companion library (blazingai
) is installed in editable mode. Which means you don't need to rebuild the Docker container every time you make a change to it.
When contributing to this repository, please consider using the following convention to label your commit messages.
BUG
: bug fixingDEV
: development environment ― e.g., Docker, TensorBoard, system dependenciesDOC
: documentationEDA
: exploratory data analysisML
: modeling, feature engineeringMAINT
: maintenance ― e.g., refactoringOPS
: MLOps ― e.g., download, unzip, pre- and post-process data