- Install nextflow with the following command (can be moved to any directory you want): (requires Java version 11 or higher)
curl -fsSL get.nextflow.io | bash
- Install singularity (https://docs.sylabs.io/guides/3.0/user-guide/installation.html)
Note: Singularity images will be created when executing the pipeline. These are automatically stored in a cache in the nextflow
work
directory AND in your home directory. The singularity cache directory can be changed from the home directory to any directory via the environment variableSINGULARITY_CACHEDIR
in your.bashrc
- Clone the grn-benchmark repository and navigate to it:
git clone git@github.com:bionetslab/grn-nextflow.git && cd grn-nextflow
- Go into
nextflow.config
and change the parametersingularity.runOptions
to the folder that contains your nextflowwork
directory and the folder that will contain your results folder (user-defined at start of pipeline). If they are separate folders, use a comma separated list
Now you are set to run the benchmark!
Relevant if the files are linked as symbolic links: create a hard copy of the files.
If you want to run the resulting shiny app saved in example_pipeline_output
, use (change $RESULT_DIRECTORIES to all result output directories of the pipeline in output
):
cp -rL example_pipeline_output output && cd output
singularity pull docker://nicolaimeyerhoefer/shiny_app
singularity exec --bind=./app:/app/,./data:/data/,./$RESULT_DIRECTORIES/:/$RESULT_DIRECTORIES/ ./shiny_app_latest.sif 'app/run_shiny.sh'
TODO: Adjust pipeline so the first copying step is not necessary! (Needed atm because pipeline only creates symlinks)
See this pdf Network Explanation
This section describes piece by piece how to run this pipeline. Examples will be provided along the way and at the end of this section.
To run the the nextflow pipeline use the following command and swap out the parameters to fit your needs. The next sections go over the paramters in detail
${path_to_nextflow}/nextflow run main.nf --tools=${tools_to_run} --mode=${data_mode} --input=${data_input} -params-file ${config_file} --publish_dir=${output_data_path}
The --tools
parameter needs to be set to identify the tools that are used in the pipeline. Current available tools are:
- DGRN inference tools:
boostdiff
(https://github.com/gihannagalindez/boostdiff_inference)zscores
(https://doi.org/10.1371/journal.pone.0009202)diffcoex
(https://doi.org/10.1186/1471-2105-11-497)
- GRN inference tools:
The --tools
parameter needs to be set as comma separated list. For example, if you want to use boostdiff and grnboost2 you need to set --tools=boostdiff,grnboost2
The --mode
parameter needs to be set to identify the data that you are using. Currently availabe modes are seurat
, tsv
and anndata
.
The full path has to be set for all input files! ALL values in the columns that are used for selection in the configuration file must not contain ",", "-", ":".
Use the --input
parameter to set the path to the seurat file. The file type must be .RDS. If you are using this mode, you need to provide a configuration file with the -params-file
parameter that contains information about the grouping/filtering that should be done in the Seurat object for your specific needs. See example_config.yaml
for instructions and an example on how to write a config file for your dataset.
Do not set the --input
parameter!
Use the --input_file1
parameter to set the path to the first tsv file.
Use the --input_file2
parameter to set the path to the second tsv file.
The first column of the tsv files has to be named Gene
and contain all gene names. The following columns represent the samples. If you are using this mode, you need to set the --comparison_id
parameter. This needs to be an identifiable string because the folder with results will be named after this. You do not need to set the -params-file
parameter as there cannot be done any grouping/filtering on the tsv files!
Set the --input
paramter to the path to the AnnData object. The file type must be .h5ad. The AnnData object will be converted to a Seurat object in the pipeline. If you are using this mode, you need to provide a configuration file with the -params-file
parameter that contains information about the grouping/filtering that should be done in the AnnData object for your specific needs. See example_config.yaml
for instructions and an example on how to write a config file for your dataset.
This parameter sets the path to the results folder where the results/outputs should be written. This folder must exist!
--create_metacells
: Default value: TRUE
Determines whether metacells should be created or not. If you do not use metacells, the computation runtime of the implemented tools is really long.--work
: Default value: Path to folder where you start the the nextflow pipeline.
Change this if you want the internal nextflow files to be stored somewhere else. The internal nextflow files can be quite big, so be careful if you have limited disk usage.--n_runs
: Default value: 10
Determines how often the tools are run that rely on randomization (boostdiff, grnboost2). This is done to improve the robustness of these tools.- See
nextflow.config
for all tool specific and nextflow specific parameters. --use_tf_list
: Default value: false
Determines if boostdiff should use a transcription factor (tf) list to only infer edges that can be from a gene in this list to any other gene. This reduces the computation time. This only works if the underlying organism is human. If this parameter is set to false, all genes will be compared to all genes. WIP: Bugfix/extend to other organisms
README is WIP: Information to come:
- Structure of the pipeline
- Instructions/Example on how to extend the pipeline with a tool/analyses