`lynference`

flowchart TD
node1["data/2021-clb-oropharynx.csv.dvc"]
node2["data/2021-usz-oropharynx.csv.dvc"]
node3["data/2023-clb-multisite.csv.dvc"]
node4["data/2023-isb-multisite.csv.dvc"]
node5["clean"]
node6["enhance"]
node7["evaluate"]
node8["filter"]
node9["join"]
node10["plot-corner"]
node11["sampling"]
node12["remote"]
node1-->node9
node1-->node12
node2-->node9
node2-->node12
node3-->node9
node3-->node12
node4-->node9
node4-->node12
node5-->node7
node5-->node11
node5-->node12
node6-->node8
node6-->node12
node7-->node12
node8-->node5
node8-->node12
node9-->node6
node9-->node12
node10-->node12
node11-->node7
node11-->node10
node11-->node12

We are researchers in the field of medical physics and want to predict how cancer in the head & neck region spreads through the lymphatic system.

lynference is a repository that allows anyone to obtain and/or reproduce the results of our research. We will use it to expose future publications to more scientific scrutiny and enable anyone to truly build upon what we have done, instead of just writing a brief methods section and call it a day.

In the following we will explain, how it all works and what you have to do to obtain/reproduce our results.

⚙️ Requirements

⚠️ NOTE:
We highly recommend using a virtual environment for anything that comes below. Feel free to use any tool you are comfortable with. We use venv, and you can use these commands to get your virtual environment started:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip setuptools wheel
using conda (and assuming it is installed on your machine), the same process would look like this:
conda create -n yourenv python=3.8 pip setuptools wheel
conda activate yourenv

All requirements can be installed using pip and the requirements.txt file at the root of this repository:

pip install -r requirements.txt

This will install these three packages and its dependencies:

DVC, which is a tool that allows the versioning of arbitrary data and pipelines, while keeping git uncluttered.
lymph-model: The implementation of our mathematical model on lymphatic spread [1]
lyscripts: A command line interface (CLI) to perform the various steps of the pipeline.

We should mention that any of the pipelines defined here also depend on some raw data, which is stored in a repository called lyDATA where we also explain how the data was extracted and what was recorded in it.

⬇️ Download Data

If you just want to download the data our pipelines have produced, you should be able to do that with only DVC installed by executing the dvc get command. Let's say you wanted to download the drawn samples.hdf5 inside the 📁 models directory of this repo at commit 123456. This is how to do it:

dvc get https://github.com/rmnldwg/lynference --rev 123456 models/samples.hdf5

⚠️ NOTE:
This does not work with the revisions bilateral-v1, midline-with-mixing-v1 or midline-without-mixing-v1. They precede how we now set up our remote DVC storage now. However, for those revisions the respective DVC remote storage is attached to the linked releases as a .zip file.

♻️ Reproduce a Pipeline

DVC makes pipelines persistent using pipeline files (like the dvc.yaml at the root and the one inside the pipeline directory) that detail how DVC should execute various commands and how they depend on each other. After a successful run of a pipeline, DVC stores the MD5 hashes of all produced files in the dvc.lock file. This allows us to store the data - which may be binary and/or very large - to be stored elsewhere, while DVC will still know how to find it.

To reproduce a pipeline, follow these steps:

1. Clone Repository

Clone this repository, enter it and checkout the revision of the pipeline you're interested in. Usually, this would be the name of a tag:

git clone https://github.com/rmnldwg/lynference.git
cd lynference
git checkout <revision-of-interest>

2. Update requirements

We might change/update the requirements.txt used to install the dependencies, so it is recommended you use

pip install -r requirements.txt

again after checking out the <revision-of-interest>.

3. Get the raw data

Now we download the raw data that is the starting point of the pipeline. Where to get it from is already defined in the .dvc files inside the data folder. We only need to tell DVC to go and get them:

dvc update --recursive ./data

Since they are also stored in the DVC remote on Azure, you could also do this:

dvc get https://github.com/rmnldwg/lynference --rev <revision-of-interest> data/<dataset>.csv

4. Start the pipeline

Finally, the pipeline can be launched. If everything works as intended the command below should launch the pipeline. Note that it may take quite some time to finish (something on the order of hours). But during the entire process, it should keep you updated about what's happening.

dvc repro pipeline

5. Cleaning up

Assuming you have used venv, all you need to do to erase the entire virtual environment, the repository, pipeline and all associated data is to deactivate the environment, leave the repository and delete it

deactivate
cd ..
rm -rf lynference

📦 Releases

If you want to see a list of pipelines we have published so far, head over to the releases on GitHub. Every successful run of a pipeline will be published as a release, alongside a ZIP file containing a DVC remote for that exact run. Read here how to use it to fetch the data from it.

The development of these pipelines might happen in dedicated pipeline-xyz branches, which may reflect unfinished stages of a pipeline, where parts crash or where we still figure out some parameters.

🧭 Navigating the repo

Here's a little overview over this repository's contents and what they do:

📄 `dvc.yaml` and `pipeline/dvc.yaml`

The dvc.yaml inside the pipeline folder defines the commands that should be run to reproduce the pipeline. It also defines what each command depends on (input files and parameters/settings) and what it outputs. In this way, it can connect the individual stages into a directed acyclic graph (DAG), which is displayed at the top for the current pipeline.

The dvc.yaml at the root of the repository does some additional stuff like creating a visual representation of the mentioned DAG and - more importantly - export the current python environment into a frozen.txt file. However, running this requires additional dependencies, and it is really only necessary, when creating a pipeline.

Look at the files and the descriptions we have put at each stage to get an idea of what happens there.

⚠️ WARNING:
Leave the dvc.lock file unchanged, it is managed by DVC.

📄 `params.yaml`

This is a configuration file that defines parameters and settings for the individual stages in the pipeline. Almost all the scripts in the lyscripts repository take a --params argument where this file is passed and use some keys and values defined there.

We have put extensive comments in that file that explain what each entry there does.

📄 `requirements.txt` and 📄 `frozen.txt`

These two text files define the Python packages necessary to run the pipeline. Note that for reproduction, you should use frozen.txt, as it is always created at the end of each pipeline run.

The requirements.txt file is only used by us during development.

📁 data

When you first clone the repository, this does not contain any data. Only two .dvc files. When issuing the command dvc update in step 3, DVC sets out and tries to get the actual data from the location defined in these .dvc files. In this case, they are fetched from the lyDATA repository.

📁 models

During the run of the pipeline, a lot of samples and predictions are produced. Most of them are stored inside HDF5 files inside this models' folder.

Essentially, all computationally intensive results are stored here from which plots and tables can be produced.

📁 plots

This stores both data series (e.g. as CSV files) and images of plots which are created during the pipeline run. Some of them serve as checks to ensure everything went smoothly during the computations.

🔜 Roadmap

We are aware that there is still work to do to make this more reproducible.

For instance, we did not manage yet to make the pipeline fully deterministic. E.g., it seems at least one library we use does not respect numpy's random number generator. But we can guarantee that the end results are all within narrow margins, even if they are rerun from scratch.

Also, the way to set up the Python environment isn't super user-friendly yet. The gold standard is of course a docker container, but we didn't get to that yet.

✉️ Anything unclear?

If there are still unanswered questions regarding this work, don't hesitate to ✉️ contact us. We are happy to help and will provide you with what we can provide.

References

[1] Roman Ludwig, B. Pouymayou, P. Balermpas, and J. Unkelbach, A hidden Markov model for lymphatic tumor progression in the head and neck, Sci Rep, vol. 11, no. 1, p. 12261, Dec. 2021, doi: https://doi.org/10.1038/s41598-021-91544-1.

rmnldwg/lynference

lynference