[Versioning]: Explore Kedro + DVC for versioning
Opened this issue Β· 5 comments
Description
At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.
The goal is to check if we can use DVC to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedroβs workflow.
As a result, we expect a working example of kedro project used with DVC for versioning and some assumptions on:
- whether it solves the main task and what are the constraints;
- how easy is to set up;
- how the workflow looks like;
- whether any changes are required on the kedro side;
- what data formats are supported;
- how easy is to work with local/remote storage;
- how demanding is it in terms of dependencies.
Context
Relates to: #2691 (comment)
If I am not wrong, what we are asking for here is support in kedro
for something similar to
dvc stage add -n train -d train.py -d data -o model.weights.h5
Ref: https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial#automating-capturing
Kedro + DVC integration
π REPO LINK: https://github.com/ankatiyar/space-dvc
I tried to integrate DVC into my Spaceflights Kedro project to check the extent of the versioning capabilities. The steps I followed are:
- Initialise the Kedro project as a git repository:
git init
- Initialise the Kedro project as a DVC repository:
dvc init
Versioning data with .dvc
files
Suppose, I have a dataset in my project:
companies:
type: pandas.CSVDataset
filepath: data/01_raw/companies.csv
dvc add data/01_raw/companies.csv
-> this generates thecompanies.csv.dvc
file which can be committed to git. Note: You would need to update the.gitignore
file provided by Kedro to make this work.git commit -m "First commit"
kedro run
- Intermediate and output datasets are generateddvc add <paths of datasets that were generated>
git add .
git commit -s -m "First run"
Scenario 1: I want to go back to a previous version of the data
- Check which git commit I want to go back to
git checkout HEAD~2 data/01_raw/companies.csv.dvc
dvc checkout
Scenario 2: I want to store my data remotely
dvc remote add <name> <url>
: https://dvc.org/doc/user-guide/data-management/remote-storagekedro run
: generates new datagit add .
and `git commit -m "Update"dvc push
Scenario 3: I want to go back to a previous version of the data, stored remotely
git checkout <>
dvc checkout
dvc pull
Versioning with DVC Data Pipelines
The previous way does provide enough functionality to version datasets but there's still the following to consider
- Users must add all intermediate and output datasets to dvc manually
- Parameters and code changes are not explicitly tracked
- Can not track artefacts and metrics
Define Kedro pipelines as DVC stages
- In
dvc.yaml
, we can define Kedro pipelines as follows
stages:
data_processing:
cmd: kedro run --pipeline data_processing
deps:
- data/01_raw/companies.csv
- data/01_raw/reviews.csv
- data/01_raw/shuttles.xlsx
outs:
- data/02_intermediate/preprocessed_companies.pq
- data/02_intermediate/preprocessed_shuttles.pq
- data/03_primary/model_input_table.pq
data_science:
cmd: kedro run --pipeline data_science
deps:
- data/03_primary/model_input_table.pq
outs:
- data/06_models/regressor.pickle
- run the pipeline with
dvc repro
Scenario: One of the datasets is updated
dvc repro
will run only pipelines for which the output or the deps have changed
Scenario: I want to track code changes
- Add code files to the
deps
indvc.yml
stages:
data_processing:
cmd: kedro run --pipeline data_processing
deps:
- data/01_raw/companies.csv
- data/01_raw/reviews.csv
- data/01_raw/shuttles.xlsx
- src/space_dvc/pipelines/data_processing/nodes.py
- src/space_dvc/pipelines/data_processing/pipeline.py
outs:
- data/02_intermediate/preprocessed_companies.pq
- data/02_intermediate/preprocessed_shuttles.pq
- data/03_primary/model_input_table.pq
dvc repro
dvc push
Scenario: I want to track parameters
- Add parameters under
params
indvc.yml
data_science:
cmd: kedro run --pipeline data_science
deps:
- data/03_primary/model_input_table.pq
- src/space_dvc/pipelines/data_science/nodes.py
- src/space_dvc/pipelines/data_science/pipeline.py
params:
- conf/base/parameters_data_science.yml:
- model_options
outs:
- data/06_models/regressor.pickle
dvc repro
dvc push
Scenario: I want to run the experiment with different parameters
- Change parameter in
parameters.yml
dvc repro
: This will only rundata_science
pipelinedvc params diff
: This will show you the comparison of parameters between runs
Experiment tracking with DVCLive
Model output can be versioned with outs
in the data pipelines through the yaml and parameters can be versioned with params
. DVC offers DVCLive, which is similar to MLFlow to track experiments.
Metrics and plots can be logged alongside all of the above following example code from their tutorial: https://github.com/iterative/example-get-started/blob/main/src/evaluate.py
This logging is not straightforward at the moment but can be simplified with a plugin like kedro-mlflow
for dvclive.
This is cool! Would it make sense to automate some of these commands with a pre/post hooks? also wdyt think about some commands to generate/sync the dvc.yml
This is cool! Would it make sense to automate some of these commands with a pre/post hooks? also wdyt think about some commands to generate/sync the
dvc.yml
It's a good idea to automate some major steps to encapsulate DVC command calls from users, so they work with kedro as usual. We were thinking of making it via plugins so the DVC dependency is not mandatory and one can switch between different versioning solutions (DVC, Iceberg, etc.). In our case plugins can simply extend basic hooks as you suggest.
To answer some of the questions mentioned on the ticket description:
whether it solves the main task and what are the constraints;
DVC does solve most of the versioning related issues arising from the research. With the current state of things, it's possible to:
- Version datasets and revert to previous versions
- Version code and parameters alongside datasets, therefore reproduce previous runs
- Do basic levels of experiment tracking, i.e. log plots and metrics with DVCLive. This support could be enhanced in the future with a plugin similar to
kedro-mlflow
.
Challenges:
- User is responsible for maintenance of the versioning data by committing any changes to VCS
- If the user is using data pipelines, they need to be familiar with DVC commands because at this point, you run DVC CLI commands like
dvc repro
anddvc pull/push
and notkedro run
.- Any arguments to
kedro run
will need to be updated with thedvc stage
command or in the YAML file directly.
- Any arguments to
- Since interaction with remote storage is handled by DVC, we're restricted by the platforms supported by them: link
how easy is to set up;
Pretty easy and their documentation is great!
how the workflow looks like;
Explained in the comment above.
whether any changes are required on the kedro side;
Not really π€
what data formats are supported;
DVC doesn't care about the data formats, so all data formats.
how easy is to work with local/remote storage;
Fairly simple, however, like I mentioned, we would likely be constrained by the remote storages supported by DVC.
how demanding is it in terms of dependencies.
Dependency tree for DVC
dvc==3.56.0
βββ attrs [required: >=22.2.0, installed: 24.2.0]
βββ celery [required: Any, installed: 5.4.0]
β βββ billiard [required: >=4.2.0,<5.0, installed: 4.2.1]
β βββ click [required: >=8.1.2,<9.0, installed: 8.1.7]
β βββ click-didyoumean [required: >=0.3.0, installed: 0.3.1]
β β βββ click [required: >=7, installed: 8.1.7]
β βββ click-plugins [required: >=1.1.1, installed: 1.1.1]
β β βββ click [required: >=4.0, installed: 8.1.7]
β βββ click-repl [required: >=0.2.0, installed: 0.3.0]
β β βββ click [required: >=7.0, installed: 8.1.7]
β β βββ prompt_toolkit [required: >=3.0.36, installed: 3.0.48]
β β βββ wcwidth [required: Any, installed: 0.2.13]
β βββ kombu [required: >=5.3.4,<6.0, installed: 5.4.2]
β β βββ amqp [required: >=5.1.1,<6.0.0, installed: 5.2.0]
β β β βββ vine [required: >=5.0.0,<6.0.0, installed: 5.1.0]
β β βββ tzdata [required: Any, installed: 2024.2]
β β βββ vine [required: ==5.1.0, installed: 5.1.0]
β βββ python-dateutil [required: >=2.8.2, installed: 2.9.0.post0]
β β βββ six [required: >=1.5, installed: 1.16.0]
β βββ tzdata [required: >=2022.7, installed: 2024.2]
β βββ vine [required: >=5.1.0,<6.0, installed: 5.1.0]
βββ colorama [required: >=0.3.9, installed: 0.4.6]
βββ configobj [required: >=5.0.6, installed: 5.0.9]
βββ distro [required: >=1.3, installed: 1.9.0]
βββ dpath [required: >=2.1.0,<3, installed: 2.2.0]
βββ dulwich [required: Any, installed: 0.22.3]
β βββ urllib3 [required: >=1.25, installed: 2.2.3]
βββ dvc-data [required: >=3.16.2,<3.17, installed: 3.16.6]
β βββ attrs [required: >=21.3.0, installed: 24.2.0]
β βββ dictdiffer [required: >=0.8.1, installed: 0.9.0]
β βββ diskcache [required: >=5.2.1, installed: 5.6.3]
β βββ dvc-objects [required: >=4.0.1,<6, installed: 5.1.0]
β β βββ fsspec [required: >=2024.2.0, installed: 2024.10.0]
β β βββ funcy [required: >=1.14, installed: 2.0]
β βββ fsspec [required: >=2024.2.0, installed: 2024.10.0]
β βββ funcy [required: >=1.14, installed: 2.0]
β βββ orjson [required: >=3,<4, installed: 3.10.9]
β βββ pygtrie [required: >=2.3.2, installed: 2.5.0]
β βββ sqltrie [required: >=0.11.0,<1, installed: 0.11.1]
β β βββ attrs [required: >=22.2.0, installed: 24.2.0]
β β βββ orjson [required: Any, installed: 3.10.9]
β β βββ pygtrie [required: Any, installed: 2.5.0]
β βββ tqdm [required: >=4.63.1,<5, installed: 4.66.5]
βββ dvc-http [required: >=2.29.0, installed: 2.32.0]
β βββ aiohttp-retry [required: >=2.5.0, installed: 2.8.3]
β β βββ aiohttp [required: Any, installed: 3.10.10]
β β βββ aiohappyeyeballs [required: >=2.3.0, installed: 2.4.3]
β β βββ aiosignal [required: >=1.1.2, installed: 1.3.1]
β β β βββ frozenlist [required: >=1.1.0, installed: 1.4.1]
β β βββ attrs [required: >=17.3.0, installed: 24.2.0]
β β βββ frozenlist [required: >=1.1.1, installed: 1.4.1]
β β βββ multidict [required: >=4.5,<7.0, installed: 6.1.0]
β β βββ yarl [required: >=1.12.0,<2.0, installed: 1.16.0]
β β βββ idna [required: >=2.0, installed: 3.10]
β β βββ multidict [required: >=4.0, installed: 6.1.0]
β β βββ propcache [required: >=0.2.0, installed: 0.2.0]
β βββ fsspec [required: Any, installed: 2024.10.0]
βββ dvc-objects [required: Any, installed: 5.1.0]
β βββ fsspec [required: >=2024.2.0, installed: 2024.10.0]
β βββ funcy [required: >=1.14, installed: 2.0]
βββ dvc-render [required: >=1.0.1,<2, installed: 1.0.2]
βββ dvc-studio-client [required: >=0.21,<1, installed: 0.21.0]
β βββ dulwich [required: Any, installed: 0.22.3]
β β βββ urllib3 [required: >=1.25, installed: 2.2.3]
β βββ requests [required: Any, installed: 2.32.3]
β β βββ certifi [required: >=2017.4.17, installed: 2024.8.30]
β β βββ charset-normalizer [required: >=2,<4, installed: 3.4.0]
β β βββ idna [required: >=2.5,<4, installed: 3.10]
β β βββ urllib3 [required: >=1.21.1,<3, installed: 2.2.3]
β βββ voluptuous [required: Any, installed: 0.15.2]
βββ dvc-task [required: >=0.3.0,<1, installed: 0.40.2]
β βββ celery [required: >=5.3.0,<6, installed: 5.4.0]
β β βββ billiard [required: >=4.2.0,<5.0, installed: 4.2.1]
β β βββ click [required: >=8.1.2,<9.0, installed: 8.1.7]
β β βββ click-didyoumean [required: >=0.3.0, installed: 0.3.1]
β β β βββ click [required: >=7, installed: 8.1.7]
β β βββ click-plugins [required: >=1.1.1, installed: 1.1.1]
β β β βββ click [required: >=4.0, installed: 8.1.7]
β β βββ click-repl [required: >=0.2.0, installed: 0.3.0]
β β β βββ click [required: >=7.0, installed: 8.1.7]
β β β βββ prompt_toolkit [required: >=3.0.36, installed: 3.0.48]
β β β βββ wcwidth [required: Any, installed: 0.2.13]
β β βββ kombu [required: >=5.3.4,<6.0, installed: 5.4.2]
β β β βββ amqp [required: >=5.1.1,<6.0.0, installed: 5.2.0]
β β β β βββ vine [required: >=5.0.0,<6.0.0, installed: 5.1.0]
β β β βββ tzdata [required: Any, installed: 2024.2]
β β β βββ vine [required: ==5.1.0, installed: 5.1.0]
β β βββ python-dateutil [required: >=2.8.2, installed: 2.9.0.post0]
β β β βββ six [required: >=1.5, installed: 1.16.0]
β β βββ tzdata [required: >=2022.7, installed: 2024.2]
β β βββ vine [required: >=5.1.0,<6.0, installed: 5.1.0]
β βββ funcy [required: >=1.17, installed: 2.0]
β βββ kombu [required: >=5.3.0,<6, installed: 5.4.2]
β β βββ amqp [required: >=5.1.1,<6.0.0, installed: 5.2.0]
β β β βββ vine [required: >=5.0.0,<6.0.0, installed: 5.1.0]
β β βββ tzdata [required: Any, installed: 2024.2]
β β βββ vine [required: ==5.1.0, installed: 5.1.0]
β βββ shortuuid [required: >=1.0.8, installed: 1.0.13]
βββ flatten-dict [required: >=0.4.1,<1, installed: 0.4.2]
β βββ six [required: >=1.12,<2.0, installed: 1.16.0]
βββ flufl.lock [required: >=8.1.0,<9, installed: 8.1.0]
β βββ atpublic [required: Any, installed: 5.0]
β βββ psutil [required: Any, installed: 6.1.0]
βββ fsspec [required: >=2024.2.0, installed: 2024.10.0]
βββ funcy [required: >=1.14, installed: 2.0]
βββ grandalf [required: >=0.7,<1, installed: 0.8]
β βββ pyparsing [required: Any, installed: 3.2.0]
βββ gto [required: >=1.6.0,<2, installed: 1.7.1]
β βββ entrypoints [required: Any, installed: 0.4]
β βββ funcy [required: Any, installed: 2.0]
β βββ pydantic [required: >=1.9.0,<3,!=2.0.0, installed: 2.9.2]
β β βββ annotated-types [required: >=0.6.0, installed: 0.7.0]
β β βββ pydantic_core [required: ==2.23.4, installed: 2.23.4]
β β β βββ typing_extensions [required: >=4.6.0,!=4.7.0, installed: 4.12.2]
β β βββ typing_extensions [required: >=4.6.1, installed: 4.12.2]
β βββ rich [required: Any, installed: 13.9.2]
β β βββ markdown-it-py [required: >=2.2.0, installed: 3.0.0]
β β β βββ mdurl [required: ~=0.1, installed: 0.1.2]
β β βββ Pygments [required: >=2.13.0,<3.0.0, installed: 2.18.0]
β βββ ruamel.yaml [required: Any, installed: 0.18.6]
β β βββ ruamel.yaml.clib [required: >=0.2.7, installed: 0.2.12]
β βββ scmrepo [required: >=3,<4, installed: 3.3.8]
β β βββ aiohttp-retry [required: >=2.5.0, installed: 2.8.3]
β β β βββ aiohttp [required: Any, installed: 3.10.10]
β β β βββ aiohappyeyeballs [required: >=2.3.0, installed: 2.4.3]
β β β βββ aiosignal [required: >=1.1.2, installed: 1.3.1]
β β β β βββ frozenlist [required: >=1.1.0, installed: 1.4.1]
β β β βββ attrs [required: >=17.3.0, installed: 24.2.0]
β β β βββ frozenlist [required: >=1.1.1, installed: 1.4.1]
β β β βββ multidict [required: >=4.5,<7.0, installed: 6.1.0]
β β β βββ yarl [required: >=1.12.0,<2.0, installed: 1.16.0]
β β β βββ idna [required: >=2.0, installed: 3.10]
β β β βββ multidict [required: >=4.0, installed: 6.1.0]
β β β βββ propcache [required: >=0.2.0, installed: 0.2.0]
β β βββ asyncssh [required: >=2.13.1,<3, installed: 2.17.0]
β β β βββ cryptography [required: >=39.0, installed: 43.0.3]
β β β β βββ cffi [required: >=1.12, installed: 1.17.1]
β β β β βββ pycparser [required: Any, installed: 2.22]
β β β βββ typing_extensions [required: >=4.0.0, installed: 4.12.2]
β β βββ dulwich [required: >=0.22.1, installed: 0.22.3]
β β β βββ urllib3 [required: >=1.25, installed: 2.2.3]
β β βββ fsspec [required: >=2024.2.0, installed: 2024.10.0]
β β βββ funcy [required: >=1.14, installed: 2.0]
β β βββ GitPython [required: >3, installed: 3.1.43]
β β β βββ gitdb [required: >=4.0.1,<5, installed: 4.0.11]
β β β βββ smmap [required: >=3.0.1,<6, installed: 5.0.1]
β β βββ pathspec [required: >=0.9.0, installed: 0.12.1]
β β βββ pygit2 [required: >=1.14.0, installed: 1.16.0]
β β β βββ cffi [required: >=1.17.0, installed: 1.17.1]
β β β βββ pycparser [required: Any, installed: 2.22]
β β βββ pygtrie [required: >=2.3.2, installed: 2.5.0]
β β βββ tqdm [required: Any, installed: 4.66.5]
β βββ semver [required: >=2.13.0, installed: 3.0.2]
β βββ tabulate [required: >=0.8.10, installed: 0.9.0]
β βββ typer [required: >=0.4.1, installed: 0.12.5]
β βββ click [required: >=8.0.0, installed: 8.1.7]
β βββ rich [required: >=10.11.0, installed: 13.9.2]
β β βββ markdown-it-py [required: >=2.2.0, installed: 3.0.0]
β β β βββ mdurl [required: ~=0.1, installed: 0.1.2]
β β βββ Pygments [required: >=2.13.0,<3.0.0, installed: 2.18.0]
β βββ shellingham [required: >=1.3.0, installed: 1.5.4]
β βββ typing_extensions [required: >=3.7.4.3, installed: 4.12.2]
βββ hydra-core [required: >=1.1, installed: 1.3.2]
β βββ antlr4-python3-runtime [required: ==4.9.*, installed: 4.9.3]
β βββ omegaconf [required: >=2.2,<2.4, installed: 2.3.0]
β β βββ antlr4-python3-runtime [required: ==4.9.*, installed: 4.9.3]
β β βββ PyYAML [required: >=5.1.0, installed: 6.0.2]
β βββ packaging [required: Any, installed: 24.1]
βββ iterative-telemetry [required: >=0.0.7, installed: 0.0.9]
β βββ appdirs [required: Any, installed: 1.4.4]
β βββ distro [required: Any, installed: 1.9.0]
β βββ filelock [required: Any, installed: 3.16.1]
β βββ requests [required: Any, installed: 2.32.3]
β βββ certifi [required: >=2017.4.17, installed: 2024.8.30]
β βββ charset-normalizer [required: >=2,<4, installed: 3.4.0]
β βββ idna [required: >=2.5,<4, installed: 3.10]
β βββ urllib3 [required: >=1.21.1,<3, installed: 2.2.3]
βββ kombu [required: Any, installed: 5.4.2]
β βββ amqp [required: >=5.1.1,<6.0.0, installed: 5.2.0]
β β βββ vine [required: >=5.0.0,<6.0.0, installed: 5.1.0]
β βββ tzdata [required: Any, installed: 2024.2]
β βββ vine [required: ==5.1.0, installed: 5.1.0]
βββ networkx [required: >=2.5, installed: 3.4.2]
βββ omegaconf [required: Any, installed: 2.3.0]
β βββ antlr4-python3-runtime [required: ==4.9.*, installed: 4.9.3]
β βββ PyYAML [required: >=5.1.0, installed: 6.0.2]
βββ packaging [required: >=19, installed: 24.1]
βββ pathspec [required: >=0.10.3, installed: 0.12.1]
βββ platformdirs [required: >=3.1.1,<4, installed: 3.11.0]
βββ psutil [required: >=5.8, installed: 6.1.0]
βββ pydot [required: >=1.2.4, installed: 3.0.2]
β βββ pyparsing [required: >=3.0.9, installed: 3.2.0]
βββ pygtrie [required: >=2.3.2, installed: 2.5.0]
βββ pyparsing [required: >=2.4.7, installed: 3.2.0]
βββ requests [required: >=2.22, installed: 2.32.3]
β βββ certifi [required: >=2017.4.17, installed: 2024.8.30]
β βββ charset-normalizer [required: >=2,<4, installed: 3.4.0]
β βββ idna [required: >=2.5,<4, installed: 3.10]
β βββ urllib3 [required: >=1.21.1,<3, installed: 2.2.3]
βββ rich [required: >=12, installed: 13.9.2]
β βββ markdown-it-py [required: >=2.2.0, installed: 3.0.0]
β β βββ mdurl [required: ~=0.1, installed: 0.1.2]
β βββ Pygments [required: >=2.13.0,<3.0.0, installed: 2.18.0]
βββ ruamel.yaml [required: >=0.17.11, installed: 0.18.6]
β βββ ruamel.yaml.clib [required: >=0.2.7, installed: 0.2.12]
βββ scmrepo [required: >=3.3.8,<4, installed: 3.3.8]
β βββ aiohttp-retry [required: >=2.5.0, installed: 2.8.3]
β β βββ aiohttp [required: Any, installed: 3.10.10]
β β βββ aiohappyeyeballs [required: >=2.3.0, installed: 2.4.3]
β β βββ aiosignal [required: >=1.1.2, installed: 1.3.1]
β β β βββ frozenlist [required: >=1.1.0, installed: 1.4.1]
β β βββ attrs [required: >=17.3.0, installed: 24.2.0]
β β βββ frozenlist [required: >=1.1.1, installed: 1.4.1]
β β βββ multidict [required: >=4.5,<7.0, installed: 6.1.0]
β β βββ yarl [required: >=1.12.0,<2.0, installed: 1.16.0]
β β βββ idna [required: >=2.0, installed: 3.10]
β β βββ multidict [required: >=4.0, installed: 6.1.0]
β β βββ propcache [required: >=0.2.0, installed: 0.2.0]
β βββ asyncssh [required: >=2.13.1,<3, installed: 2.17.0]
β β βββ cryptography [required: >=39.0, installed: 43.0.3]
β β β βββ cffi [required: >=1.12, installed: 1.17.1]
β β β βββ pycparser [required: Any, installed: 2.22]
β β βββ typing_extensions [required: >=4.0.0, installed: 4.12.2]
β βββ dulwich [required: >=0.22.1, installed: 0.22.3]
β β βββ urllib3 [required: >=1.25, installed: 2.2.3]
β βββ fsspec [required: >=2024.2.0, installed: 2024.10.0]
β βββ funcy [required: >=1.14, installed: 2.0]
β βββ GitPython [required: >3, installed: 3.1.43]
β β βββ gitdb [required: >=4.0.1,<5, installed: 4.0.11]
β β βββ smmap [required: >=3.0.1,<6, installed: 5.0.1]
β βββ pathspec [required: >=0.9.0, installed: 0.12.1]
β βββ pygit2 [required: >=1.14.0, installed: 1.16.0]
β β βββ cffi [required: >=1.17.0, installed: 1.17.1]
β β βββ pycparser [required: Any, installed: 2.22]
β βββ pygtrie [required: >=2.3.2, installed: 2.5.0]
β βββ tqdm [required: Any, installed: 4.66.5]
βββ shortuuid [required: >=0.5, installed: 1.0.13]
βββ shtab [required: >=1.3.4,<2, installed: 1.7.1]
βββ tabulate [required: >=0.8.7, installed: 0.9.0]
βββ tomlkit [required: >=0.11.1, installed: 0.13.2]
βββ tqdm [required: >=4.63.1,<5, installed: 4.66.5]
βββ voluptuous [required: >=0.11.7, installed: 0.15.2]
βββ zc.lockfile [required: >=1.2.1, installed: 3.0.post1]
βββ setuptools [required: Any, installed: 75.1.0]