A template creation tool for Machine Learning and Data Science projects.
π·πΊ ΠΠ΄Π΅ΡΡ Π»Π΅ΠΆΠΈΡ ΡΡΡΡΠΊΠΎΡΠ·ΡΡΠ½Π°Ρ Π²Π΅ΡΡΠΈΡ ΡΡΠΎΠ³ΠΎ README.
- Install Sphinx for automatic documentation support.
Follow this link for the installation instructions. Preferred way of installing is via pip3: pip3 install -U sphinx
.
- Execute commands in Terminal:
sudo -i
git clone https://github.com/EnlightenedCSF/Ocean.git
cd <cloned repo>
pip install --upgrade .
Creating a new project:
ocean project new -n "<project_name>" \ # ! must be provided !
-a "<author>" \ # default is `Surf`
-v "<version>" \ # default is `0.0.1`
-d "<description>" \ # default is ``
-l "<licence>" \ # default is `MIT`
-p "<path>" # default is `.`
Install the project code as a package:
make -B package
Creating a new experiment in the project:
ocean exp new -n "<exp_name>" # ! must be provided !
-a "<author>" # ! must be provided !
The project is based on cookiecutter-data-science template, but is a modification of it. Before continue reading, I highly recommend you to follow the given link and take a look, because many key points listed there are important.
Let's see how the original cookiecutter is structured:
βββ LICENSE
βββ Makefile <- Makefile with commands like `make data` or `make train`
βββ README.md <- The top-level README for developers using this project.
βββ data
β βββ external <- Data from third party sources.
β βββ interim <- Intermediate data that has been transformed.
β βββ processed <- The final, canonical data sets for modeling.
β βββ raw <- The original, immutable data dump.
β
βββ docs <- A default Sphinx project; see sphinx-doc.org for details
β
βββ models <- Trained and serialized models, model predictions, or model summaries
β
βββ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β the creator's initials, and a short `-` delimited description, e.g.
β `1.0-jqp-initial-data-exploration`.
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials.
β
βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β βββ figures <- Generated graphics and figures to be used in reporting
β
βββ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β generated with `pip freeze > requirements.txt`
β
βββ setup.py <- Make this project pip installable with `pip install -e`
βββ src <- Source code for use in this project.
β βββ __init__.py <- Makes src a Python module
β β
β βββ data <- Scripts to download or generate data
β β βββ make_dataset.py
β β
β βββ features <- Scripts to turn raw data into features for modeling
β β βββ build_features.py
β β
β βββ models <- Scripts to train models and then use trained models to make
β β β predictions
β β βββ predict_model.py
β β βββ train_model.py
β β
β βββ visualization <- Scripts to create exploratory and results oriented visualizations
β βββ visualize.py
β
βββ tox.ini <- tox file with settings for running tox; see tox.testrun.org
It can be upgraded at once:
- we added
make docs
command for automatic generation of Sphinx documentation based on a wholesrc
module's docstrings; - we added a conveinient file logger (and
logs
folder, respectivelly); - we added a coordinator entity for an easy navigation throughout the project, taking off the necessity of writing
os.path.join
,os.path.abspath
ΠΈΠ»ΠΈos.path.dirname
every time.
But what problems are there?
- The folder
data
could grow significantly, but what script/notebook generated each file is a mystery. The amount of different files stored there can be misleading. Also it is not clear whether any of them is useful for a new feature implementation, because there is no place to contain descriptions and explanations. - The folder
data
lacks thefeatures
submodule which could be a good use: the one can store calculated statistics, embeddings and other features. There is a nice writing about this which I strongly recommend. - The
src
folder is an another problem. It contains both functionality that is relevant project-wise (likesrc.data
submodule) and functionality relevant to concrete and often small sub-tasks (likesrc.models
). - The folder
references
exists, but there is an opened question, who, when and how has to put some records there. And there is a lot to explain during the development process: which experiments have been done, what were the results, what are we doing next.
For a sake of solving listed problems I introduce the experiment entity.
So, the experiment is a place which contains all the data relevant to some hypothesis checking.
Including:
- What data was used
- What data (or artefacts) was produced
- Code version
- Timestamp of beginning and ending of an experiment
- Source file
- Parameters
- Metrics
- Logs
Many things can be logged via tracker utilities like mlflow, but it is not enough. We can improve our workflow.
This is what an example experiment looks like:
<project_root>
βββ experiments
βββ exp-001-Tree-models
β βββ config <- yaml-files with grid search parameters or just model parameters
β βββ models <- dumped models
β βββ notebooks <- notebooks for research
β βββ scripts <- scripts like train.py or predict.py
β βββ Makefile <- for handling experiment with just few words put in console
β βββ requirements.txt <- dependent libraries
β βββ log.md <- logs of how the experiment is going
β
βββ exp-002-Gradient-boosting
...
Let's take a look at the workflow for one experiment.
- The notebooks are created where data is being prepared for a model, and model's structure is being introduced.
- Once the code is ready, it is moved to
train.py
- Use might track model parameters from there (for instance, with
mlflow
) - Create a relevant
config
-file for a training configuration - The code should has the possibility to be called from the console
- It could take paths to the data, the
config
-file, and the directory to dump model to.
- Use might track model parameters from there (for instance, with
- Then, Makefile is modified to start the training process via console. Provide a command like
make train
. - Many models are trained, all the metrics and parameters are sent to
mlflow
. The one can usemlflow ui
to check the results. - Finally, all results are being recorded into
log.md
. It has some impact analysis elements: the developer needs to point out what data was used and what data was generated. This clarification can be used to generate automatically a readme file for adata
folder and clarify where which file is used.