-
Clone this repository
-
Activate venv and
pip install -r requirements.txt
to install dependencies -
Use
pip install 'dvc[gdrive]'
to add gdrive -
Use
dvc pull
anddvc repro
to reproduce the pipeline (ask me for access to use data from remote) -
Use
dvc metrics show
to get metrics of the models
Data governance for ML (DVC)
- Details:
- Use dataset from previous task. Make initial setup using
Data version control
tool. - Define 2-3 pipelines that would preprocess data in different ways (basic
cleaning, scaling, aggregations, etc.). Each pipeline should be
reproducible using
dvc repro
. - Use some existing
solution for your dataset, run experiment on the data using development
environment from previous step and save metrics using
dvc metrics
- Use dataset from previous task. Make initial setup using
- Criteria:
- Pipelines defined in a simple, reproducible manner
- Following DVC best practices
- Code style / code quality tools used
- There is an existing remote from which one could pull data (use free tier of AWS/GCP, Google Drive, or any other that would be easy to share)
- Materials: