/async-encfs-dvc-demo

Data version control in privacy-preserving HPC workflows using DVC, EncFS, SLURM and Openstack Swift on https://castor.cscs.ch

Primary LanguageJupyter Notebook

Data version control in privacy-preserving HPC workflows using DVC, EncFS, SLURM and Openstack Swift

This project applies infrastructure-as-code principles to DVC and integrates it with EncFS and SLURM to track results of scientific HPC workflows in a privacy-preserving manner, exchanging them via the OpenStack Swift object storage at https://castor.cscs.ch.

The key features extending DVC include

  • SLURM integration for HPC clusters: DVC stages and their dependencies can be executed asynchronously using SLURM (dvc repro submits a SLURM job, dvc commit is run upon job completion)
  • privacy-preservation: DVC stages can utilize a transparently encrypted filesystem with EncFS ensuring no unencrypted data is persisted to storage or exchanged through DVC (see further details)
  • container support: DVC stages can be run with the Docker and Sarus engines such that code dependencies are tracked via Git-SHA-tagged container images, making stages fully re-executable
  • infrastructure-as-code practice: DVC repository and stage structure can be encoded into reusable YAML policies, enabling different users to generate DVC stages that comply to the same workflow organization

These capabilities extend, rather than modify, DVC and can largely be used independently. The repository includes three demo applications as surrogates for workflow stages - app_ml for a machine learning application, app_sim for a simulation and app_prep for a preprocessing step that is performed manually or automated. Each of them is accompanied by a corresponding app policy that references repository and stage policies, all of which can be customized inside a DVC repository to reflect evolving requirements. This design facilitates a clean separation of data-related configuration from application logic in code. A scientific workflow may also include custom application protocols. These could be defined in an additional package (e.g. another folder on the same level als app_...) and imported by the participating applications. It is important to note that DVC does not have a concept for application protocols, but only tracks dependencies between files. Finally, the repository also contains a full example in ex_vit which demonstrates the use of the package in distributed training and inference with a deep learning model.

The next sections cover the installation and usage of the tool. Specifically, the latter introduces a series of tutorial notebooks that exemplify various use cases. For further details on usage and implementation, please consult the dedicated documentation. Finally, an overview on performance on Piz Daint and Castor's object storage is attached in the performance report.

Installation

The package can be installed in your Python environment with

pip install git+https://github.com/eth-cscs/async-encfs-dvc.git

This will install all dependencies except for EncFS. If encryption is required, follow the separate installation instructions.

Usage

A set of notebooks illustrating the usage of individual features is available in the examples directory.

  • a demonstration of DVC stage generation with infrastructure-as-code principles for a machine learning pipeline is available in the ML repository tutorial.
  • an iterative simulation workflow running with Docker containers on encrypted data can be found in the EncFS-simulation tutorial. This workflow is also used as a performance benchmark (see results below)
  • a DVC workflow with asynchronous stages in SLURM and Sarus containers is available in the SLURM tutorial
  • a PyTorch deep learning application that integrates the above concepts to run distributed training and inference with a vision transfomer model on encrypted data using SLURM

For a step-by-step guide on setting up a DVC repository to track workflow results, (optionally) using encryption and Castor's object storage as a remote, please refer to the setup guide. For details on available commands, consult the command reference.

Test suite

The project also includes a test suite that can be run with tox. If EncFS and SLURM are not installed locally, use tox -e py39-default to only run tests without these requirements. In contrast, to run only EncFS-tests, execute tox -e py39-encfs and to run only the SLURM-tests use tox -e py39-slurm.

Performance on Piz Daint and Castor

We use the iterative_sim benchmark with and without encryption as illustrated in the EncFS-simulation tutorial on Piz Daint and Castor. This benchmark creates and runs a pipeline of DVC stages that form a linear dependency graph. In contrast to the corresponding tutorial that focuses on a single node with Docker, every app_sim stage is run with Sarus on 8 GPU nodes and 16 ranks. Each rank writes its payload sampled from /dev/urandom with dd to the filesystem using a single thread. The aggregate output payload per DVC stage is increased in powers of 2, from 16 GB to 1.024 TB in our runs (using decimal units, i.e. 1 GB = 10^9 B). The subsequent dvc commit and dvc push commands are run on a single multi-core node. The software configuration used is available at iterative_sim.config.md and detailed logs can be found in the benchmarks branch.

We use three different configurations for

  • large files: 1 file per rank, i.e. starting with 1 GB
  • medium-sized files: 10^3 files per rank, i.e. starting with 1 MB
  • small files: 10^4 files per rank, i.e. starting with 100 KB and vary the total per-rank payload from 1 GB to 64 GB in powers of 2 (as stated above). On the scratch filesystem on Piz Daint, creating the 7 stages for the DVC pipeline of a single configuration takes around 18 s (irrespective of whether EncFS is used).

When run without encryption/EncFS, we obtain the following results on Piz Daint for large files

individual file size per-rank payload aggregate payload stage time (SLURM step) dvc commit time (SLURM step) dvc push time to Castor (SLURM step)
1 GB 1 GB 16 GB 0m 27.052s 0m 46.797s 2m 33.804s
2 GB 2 GB 32 GB 0m 31.342s 1m 19.278s 4m 2.548s
4 GB 4 GB 64 GB 1m 22.411s 2m 41.182s 7m 43.655s
8 GB 8 GB 128 GB 2m 0.840s 5m 30.745s 14m 25.279s
16 GB 16 GB 256 GB 3m 59.262s 11m 14.833s 26m 37.984s
32 GB 32 GB 512 GB 7m 42.820s 26m 35.141s 55m 34.079s
64 GB 64 GB 1.024 TB 15m 35.199s 41m 4.529s 110m 26.116s

with medium-sized files

individual file size per-rank payload aggregate payload stage time (SLURM step) dvc commit time (SLURM step) dvc push time to Castor (SLURM step)
1 MB 1 GB 16 GB 0m 23.444s 17m 36.499s 8m 3.508s
2 MB 2 GB 32 GB 0m 33.024s 19m 52.038s 9m 2.840s
4 MB 4 GB 64 GB 1m 1.768s 22m 33.451s 13m 32.501s
8 MB 8 GB 128 GB 2m 7.507s 36m 13.483s 20m 17.103s
16 MB 16 GB 256 GB 4m 15.086s 39m 57.099s 35m 32.338s
32 MB 32 GB 512 GB 7m 43.991s 53m 24.634s 65m 55.366s
64 MB 64 GB 1.024 TB 15m 40.195s 76m 0.389s 123m 25.036s

and with small files

individual file size per-rank payload aggregate payload stage time (SLURM step) dvc commit time (SLURM step) dvc push time to Castor (SLURM step)
100 KB 1 GB 16 GB 0m 52.073s 202m 25.826s 50m 17.943s
200 KB 2 GB 32 GB 0m 50.400s 199m 3.391s 55m 47.384s
400 KB 4 GB 64 GB 1m 19.986s 209m 17.751s 59m 49.833s
800 KB 8 GB 128 GB 2m 43.147s 234m 42.978s 68m 18.372s
1.600 MB 16 GB 256 GB 4m 27.615s 237m 40.735s 82m 38.656s
3.200 MB 32 GB 512 GB 8m 5.145s 272m 10.639s 110m 47.234s
6.400 MB 64 GB 1.024 TB 15m 55.370s 407m 14.425s 167m 18.985s

We obtain the following performance numbers for the application stage in the same configurations with encryption/EncFS

per-rank payload aggregate payload stage time (large files) stage time (medium files) stage time (small files)
1 GB 16 GB 0m 39.529s 0m 55.533s 1m 57.455s
2 GB 32 GB 1m 17.725s 1m 15.242s 2m 35.338s
4 GB 64 GB 2m 28.160s 2m 49.697s 3m 27.868s
8 GB 128 GB 4m 57.152s 4m 59.661s 5m 44.948s
16 GB 256 GB 9m 38.297s 9m 46.761s 10m 5.082s
32 GB 512 GB 18m 47.857s 19m 52.775s 18m 47.612s
64 GB 1.024 TB 37m 27.673s 38m 53.677s 44m 10.290s

These results can be summarized in the following bandwith plot as a function of individual file size.

iterative_sim benchmark results on Piz Daint/Castor

When sampling from /dev/zero instead of /dev/urandom, the write-throughput of the application stage is about 4-5x higher without encryption than with EncFS for large files.

Note that only one dvc commit or dvc push SLURM job can run per DVC repo at any time, while stages can run be concurrently (as long as they are not DVC dependencies of one another). If dvc commit or dvc push are throughput-limiting steps, the most effective measure is to avoid small file sizes (>= 10 MB is ideal). Besides that, one can increase the performance by running disjoint pipelines in separate clones of the Git/DVC repo. To avoid redundant transfers of large files over a slow network connection, it can be useful to synchronize over the local filesystem by adding a local remote, e.g. with

dvc remote add daint-local $SCRATCH/path/to/remote

and then

dvc push/pull --remote daint-local <stages>

In this manner, the download overhead of shared dependencies can be avoided (file hashes are recomputed, however). Furthermore, to avoid long file hash recalculations (in dvc commit) upon small, localized changes in a very large file, try to store it as multiple separate files rather than a single very large one if it needs to be changed regularly. When e.g. using HDF5 as an application protocol, consider using external links to split the HDF5 into multiple files each containing a dataset (cf. this discussion).

If these techniques do not alleviate the issue with throughput, a draft of running dvc commit/push operations out-of-repo instead in-repo is available (can be activated by exporting DVC_SLURM_DVC_OP_OUT_OF_REPO=YES). The intention is to run the computationally expensive part in e.g. dvc commit in a separate, temporary DVC repo with all top folders under $(dvc root) except .dvc as symbolic links to the original repo and then have a short-running process that synchronizes with the main repo. The jobs running on DVC repos outside the main one are then parallelizable. Currently, there is no speedup for dvc commit, though, as file hashes are recomputed on every dvc pull (i.e. the cache.db's entries are not synchronized by a local dvc pull).