Large scale correlation network analysis for unraveling the coordination of complex biological systems
Advanced measurement and data storage technologies have enabled high-dimensional profiling of complex biological systems. For this, modern multiomics studies regularly produce datasets with hundreds of thousands of measurements per sample enabling a new era of precision medicine. Correlation analysis is an important first step to gain deeper insights into the coordination and underlying processes of such complex systems. However, the construction of large correlation networks in modern high-dimensional datasets remains a major computational challenge due to rapidly growing runtime and memory requirements. We address this challenge by introducing, CorALS, an open-source framework for the construction and analysis of large-scale parametric as well as nonparametric correlation networks for high-dimensional biological data. It features off-the-shelf algorithms suitable for both personal and high-performance computers enabling workflows and downstream analysis approaches. We illustrate the broad scope and potential of CorALS by exploring perspectives on complex biological processes in large-scale multiomics and single cell studies.
Note: This repository provides the code to reproduce the results and analyses from the manuscript. The corresponding Python package is maintained in a separate repository.
Generally, the provided code is executable on any machine that can run Docker
. It has been run and tested on Docker version 20.10.7, build f0df350
.
Any regular hardware that can run Docker should be able to run the code. However, several baseline experiments as well as some of the large-scale experiments require dedicated server hardware. We recommend a machine with at least 64 cores and 312Gb of memory to produce comparable results to the benchmarks reported in the manuscript.
Clone this repository:
git clone git@github.com:mgbckr/corals-experiments.git
# git clone https://github.com/mgbckr/corals-experiments.git
Make sure Docker is installed and configured properly. For this refer to the official Docker installation guide.
Runtime: Building the Docker container can take an hour or longer. This is due to having to compile various R libraries.
In the following, we assume you are running on Linux to start the Docker container. Please adjust as necessary.
# change into the experiments repository
cd corals-experiments
# pull Docker image from Docker Hub
docker pull mgbckr/corals-experiments:1.0.0
# or build your own docker file
# > docker build . -f Dockerfile.base -t mgbckr/corals-experiments-base:1.0.0
# > docker build . -t mgbckr/corals-experiments:1.0.0 --build-arg BASE_IMAGE=mgbckr/corals-experiments-base:1.0.0
# only build the final Docker image and pull the base image from Docker Hub
# > docker build . -t mgbckr/corals-experiments:1.0.0 --build-arg BASE_IMAGE=mgbckr/corals-experiments-base:1.0.0
docker run -v $(pwd):/workspace \
-it mgbckr/corals-experiments:1.0.0
# WITHIN the container run
pip install -e .
Runtime: About 15-20 minutes
After starting the Docker container, please run the following commands from inside the Docker container.
# create data dir
mkdir -p data/raw
mkdir -p data/processed
wget https://zenodo.org/record/7713898/files/preeclampsia_data.zip
unzip preeclampsia_data.zip
rm preeclampsia_data.zip
wget https://zenodo.org/record/7713898/files/multiomics_data.zip
unzip multiomics_data.zip
rm multiomics_data.zip
mkdir -p data/raw/cancer
bash src/coralsarticle/data/process/cancer_download.sh
wget https://zenodo.org/record/7713898/files/singlecell_data.zip
unzip singlecell_data.zip
rm singlecell_data.zip
python src/coralsarticle/data/prepare.py
Runtime: The currently enabled experiments can already take several hours. Full experiments can run up to a week or more. See the main manuscript for runtimes of individual experiments. Experiments are repeated two to ten times depending on runtime. Generally, running on a larger machine is recommended (at least 64 cores and 312Gb of memory), since particularly the baseline methods need a lot of memory.
Note: Particularly long-running or resource-intensive experiments are uncommented or excluded. See files in config
folder for more details and additional experiments to reproduce all results.
# default
python src/coralsarticle/benchmark/resources.py -c config/full/bench_full_default.yml
# python src/coralsarticle/benchmark/resources.py -c config/full/bench_full_default_cancer_small.yml
# python src/coralsarticle/benchmark/resources.py -c config/full/bench_full_default_cancer_medium.yml
# python src/coralsarticle/benchmark/resources.py -c config/full/bench_full_default_singlecell.yml
# parallel
python src/coralsarticle/benchmark/resources.py -c config/full/bench_full_parallel.yml
# python src/coralsarticle/benchmark/resources.py -c config/full/bench_full_parallel_cancer_small.yml
# python src/coralsarticle/benchmark/resources.py -c config/full/bench_full_parallel_cancer_medium.yml
# python src/coralsarticle/benchmark/resources.py -c config/full/bench_full_parallel_singlecell.yml
# default
python src/coralsarticle/benchmark/resources.py -c config/topk/bench_topk_default.yml
# python src/coralsarticle/benchmark/resources.py -c config/topk/bench_topk_default_cancer_small.yml
# python src/coralsarticle/benchmark/resources.py -c config/topk/bench_topk_default_cancer_medium-large.yml
# python src/coralsarticle/benchmark/resources.py -c config/topk/bench_topk_default_singlecell.yml
# parallel
# python src/coralsarticle/benchmark/resources.py -c config/topk/bench_topk_parallel_memory.yml
# python src/coralsarticle/benchmark/resources.py -c config/topk/bench_topk_parallel_runtime.yml
# full - r comparison
python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_full_r-comparison.yml
### top-k tree variants
python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_topk_tree-variants.yml
### full - synthetic
python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_full_synthetic.yml
### topk - synthetic
python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_topk_synthetic_features.yml
python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_topk_synthetic_samples.yml
### top-k parallel
python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_topk_parallel_memory.yml
python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_topk_parallel_runtime.yml
# top-k parallel - slow
# python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_topk_parallel_memory_slow.yml
# python src/coralsarticle/benchmark/resources.py -c config/supplement/bench_topk_parallel_runtime_slow.yml
Runtime: Several hours
# topk
# python src/coralsarticle/benchmark/accuracy_topk.py --data preeclampsia_postprocessed_nonegatives_dropduplicates --k_ratio 0.001 --max_approx 10 --n_threads 64
# python src/coralsarticle/benchmark/accuracy_topk.py --data pregnancy_postprocessed_nonegatives_dropduplicates --k_ratio 0.001 --max_approx 10 --n_threads 64
# python src/coralsarticle/benchmark/accuracy_topk.py --data cancer_postprocessed_nonegatives_dropduplicates_sample-0.25 --k_ratio 0.001 --max_approx 10 --n_threads 64
# topk diff
# python src/coralsarticle/benchmark/accuracy_topkdiff.py --data pregnancy --k_ratio 0.001
Runtime:
- Setup: ~1 hour
- Experiments: 10 minutes
The folder scripts
contains additional code snippets for comparing a variety of other software packages to CorALS.
See scripts/benchmark_cor.sh
and scripts_topkdiff.py
Start Jupyter server within the docker container:
docker run -v $(pwd):/workspace \
--publish 17299:8888 \
-it mgbckr/corals-experiments:1.0.0
# WITHIN the Docker container:
pip install -e .
jupyter lab --allow-root --ip 0.0.0.0 --NotebookApp.token=''
Open http://localhost:17299
in your browser an run the notebooks in the notebooks
folder.
Runtime: The notebooks (xx_application_*
) can run a long time taking from several hours up to a day. Running on a larger machine is recommended (at least 64 cores and 312Gb of memory).
The notebooks reproduce all results, tables, and figures from the manuscript.
In order to skip some of the long running tasks like benchmarking, dimensionality reduction, and sampling, we provide intermediate data. This allows you to
- skip the benchmarking experiments (see above)
- skip
02_application_multiomics_prepare-embeddings.ipynb
- run
03.02_application_multiomics.ipynb
instead of03.01_application_multiomics.ipynb
- skip
04_application_singlecell_prepare-data.ipynb
and04_application_singlecell_prepare-data.ipynb
For this, run the following:
wget https://zenodo.org/record/7713898/files/benchmark.zip
wget https://zenodo.org/record/7713898/files/multiomics_figure.zip
wget https://zenodo.org/record/7713898/files/singlecell_data.zip # skip if already done above for downloading data
wget https://zenodo.org/record/7713898/files/singlecell_figure.zip
unzip *.zip
rm *.zip
Note: In order to produce the data statistics table (5.1 Data Table
) in 01_benchmarks.ipynb
you still need to run the data preprocessing scripts mentioned above.
The Jupyter notebook notebooks/07_examples.ipynb
contains examples on how to use the Python implementation of CorALS. Additionally, the Python package and the corresponding Github repository provides further examples, documentation, and tests (notebook, tests).
CorALS generally outperforms comparable full correlation matrix methods like numpy.corrcoef
.
Thus, we generally recommend using CorALS for full correlation matrix estimation as long as the final matrix fits into memory.
Otherwise, top-k estimation may be a better choice.
For top-k correlation search, we recommend using the basic CorALS implementation (referred to as matrix in Table 3) as long as the full correlation matrix fits into memory, independent of the number of samples.
However, as the number of features increases, memory issues will make this approach impossible to use. When this is the case, switching to the index based CorALS implementation is the best option.
Note 1: With increasing sample numbers, CorALS becomes slower, which may warrant other heuristics such as dimensionality reduction such as locality sensitive hashing or random projections. However, this exploration is left for future work.
Note2: Per default, the top-k approximation approach does not guarantee symmetric results, i.e., even if cor(x, y)
is returned, cor(y, x)
may be missing. This can be addressed by various post-processing steps, e.g., by adding missing values. CorALS provides the option to enable this feature.
If you want to modify the libraries while running things in Docker:
-
Clone the different implementation repositories:
git clone git@github.com:mgbckr/corals-lib-python.git git clone git@github.com:mgbckr/corals-lib-julia.git git clone git@github.com:mgbckr/corals-lib-r.git
-
And run Docker with the following options:
docker run -v $(pwd):/workspace \ -v $(pwd)/../corals-lib-python:/opt/libraries/python \ -v $(pwd)/../corals-lib-julia:/opt/libraries/julia \ -v $(pwd)/../corals-lib-r:/opt/libraries/r \ # ADDITIONAL ARGUMENTS HERE, e.g. --publish 17299:8888 \ -it mgbckr/corals-experiments:1.0.0 \
-
Within the Docker container *
-
For the Python library it may be necessary to run
pip install -e .
once within the Docker container.cd /opt/libraries/python python install -e .
-