a kedro hook to protect against breaking changes to data
steel-toes is a kedro hook designed to prevent stepping on your teammates toes. It will branch your data automatically based on your git branch, or manually by passing the branch name into the hook.
see docs
kedro
is a ✨ fantastic project that allows for super-fast prototyping of
data pipelines, while yielding production-ready pipelines. kedro
promotes
collaborative projects by giving each team member access to the exact same
data. Team members will often make their own branch of the project and begin
work. Sometimes these changes will break existing functionality. Sometimes we
make mistakes as we develop, and fix them before merging in. Either case can be
detrimental to a teammate working downstream of your changes if not careful.
steel-toes
hooks into your catalog to prevent changing downstream data on
your teammates while developing in parallel.
When your project creates a catalog steel-toes
will look to see if branched
data exists, if it does it will swap the filepath to the branched path. So you
will be able to load the latest data from the perspective of any branch
simulaneusly.
After your node is ran, before saving, steel-toes
will check if your
filepath
was swapped, if not it will swap it to the branched filepath
before saving.
steel-toes
is deployed to pypi and can be pip
installed.
pip install steel-toes
For a real kedro project you should add to your requirements.
To add SteelToes
to your kedro>0.18.0 project add an instance of the
SteelToes
hook to your tuple of hooks in src/<project_name>/settings.py.
# settings.py
from steel_toes import SteelToes
HOOKS = (SteelToes(),)
Some datasets have a _filepath
attribute that is not meant for saving
datasets to and is not needed to be "branched", and should be ignored from
steel_toes, for example SQLQueryDataSet
.
# settings.py
from kedro.extras.datasets.pandas.sql_dataset import SQLQueryDataSet, SQLTableDataSet
from steel_toes import SteelToes
HOOKS = (SteelToes(ignore_types=[SQLQueryDataSet, SQLTableDataSet]),)
steel_toes
will automatically get the branch name from your git branch.
In certain situations such as using kedro docker
in production, there is no
git branch to pull from. Setting an environment variable before steel-toes
initializes will set the branch.
STEEL_TOES_BRANCH='PROD'
import os
os.environ["STEEL_TOES_BRANCH"] = "PROD"
Here is an example of what filepaths look like when I add parquet catalog
entries to the spaceflights project, steel_toes
will add the branch name
automatically just before the file extension.
STEEL-TOES |6 DATASETS PROETECTED
X_test: /home/waylon/git/spaceflights/data/X_test_main.pq
X_train: /home/waylon/git/spaceflights/data/X_train_main.pq
preprocessed_companies: /home/waylon/git/spaceflights/data/02_intermediate/preprocessed_companies_main.pq
preprocessed_shuttles: /home/waylon/git/spaceflights/data/02_intermediate/preprocessed_shuttles_main.pq
model_input_table: /home/waylon/git/spaceflights/data/03_primary/model_input_table_main.pq
regressor: /home/waylon/git/spaceflights/data/06_models/regressor_main.pickle
When first running your pipeline with steel-toes
it will start the
_filepath
swap after_node_run, since the swapped file does not yet exist.
At this point catalog.load('preprocessed_shuttles') will not load the branched dataset.
❯ kedro run
INFO Kedro project spaceflights session.py:340
...
INFO STEEL_TOES:after_node_run 'preprocessed_shuttles.pq' -> 'preprocessed_shuttles_main.pq' steel_toes.py:102
...
INFO Completed 6 out of 6 tasks sequential_runner.py:85
INFO Pipeline execution completed successfully. runner.py:90
Subsequent runs of kedro will swap the dataset to the branched filepath immediately after the catalog has been created.
Now catalog.load('preprocessed_shuttles') will load the branched dataset.
INFO Kedro project spaceflights session.py:340
...
INFO STEEL_TOES:after_catalog_created 'preprocessed_shuttles.pq' -> 'preprocessed_shuttles_main.pq' steel_toes.py:102
...
INFO Completed 6 out of 6 tasks sequential_runner.py:85
INFO Pipeline execution completed successfully. runner.py:90
### CLI Usage
The CLI provides a handy interface to clean up your branched datasets.
```bash
$ steel-toes --help
Usage: steel-toes [OPTIONS] COMMAND [ARGS]...
help
Options:
-V, --version Prints version and exits
--help Show this message and exit.
Commands:
clean-branch finds branch datasets and removes them
steel-toes
also registers itself as a kedro
global cli plugin. You can run kedro clean-branch
to clean your branched data.
$ kedro clean-branch --help
Usage: kedro clean-branch [OPTIONS]
finds branch datasets and removes them
Options:
--dryrun Displays the files that would be deleted using
the specified command without actually deleting
them.
-b, --branch TEXT git branch to clean files from
-h, --help Show this message and exit.
To clean up your current branch, running kedro clean-branch
will remove all
the datasets that have been swapped to the current branch. Adding --dryrun
will only log what steel-toes
intends to do, and will not delete.
❯ kedro clean-branch --dryrun
INFO STEEL_TOES:after_catalog_created 'preprocessed_shuttles.pq' -> 'preprocessed_shuttles_main.pq' steel_toes.py:102
...
INFO STEEL_TOES:dryrun-remove | '/home/waylon/git/spaceflights/data/02_intermediate/preprocessed_shuttles_main.pq' steel_toes.py:141
Dropping the --dryrun
flag will delete all the branched datasets.
❯ kedro clean-branch
INFO STEEL_TOES:after_catalog_created 'preprocessed_shuttles.pq' -> 'preprocessed_shuttles_main.pq' steel_toes.py:102
...
INFO STEEL_TOES:deleting | '/home/waylon/git/spaceflights/data/02_intermediate/preprocessed_shuttles_main.pq' steel_toes.py:141
You can disable steel-toes
by setting the STEEL_TOES_ENABLED
environment
variable to False
. This might be useful for debugging inside an environment
that you cannot easily make a code change to.
Mac/Linux
export STEEL_TOES_ENABLED=False
Windows
set STEEL_TOES_ENABLED=False
You're Awesome for considering a contribution! Contributions are welcome, please check out the Contributing Guide for more information. Please be a positive member of the community and embrace feedback
We use SemVer for versioning. For the versions available, see the tags on this repository.
- Waylon Walker - Original Author
This project is licensed under the MIT License - see the LICENSE. file for details