version = '0.0.15'
Workflow to download and prepare TCGA data.
The workflow divides the process of downloading the data in two steps:
- Downloading the raw data from GDC and saving the rds/tables needed later
- Preparing the data. This step includes filtering the data, normalizing it...
- Analysis of gene regulatory networks
The idea is that data should be downloaded once, and then prepared for the task at hand.
Where can I find more details about this workflow?
If you want more details about the scope and use of this workflow, for instance, you want to decide if it's useful for your research, we recommend you check out the paper: "Reproducible processing of TCGA regulatory networks".
Are there examples of how to configure the workflow or sample datasets?
Of course! In the QuackenbushLab/tcga-data-supplement repository you will find the companion data and configuration files for the paper. You can read about a full analysis we did on colon cancer and find all links/instructions for the precomputed GRNs of common cancer types.
- First you'll need to install nextflow on your machine. Follow the
hello world
example to check if Nextflow is up and running. - Pull the workflow
nextflow pull QuackenbushLab/tcga-data-nf
- Install and pull the docker/singularity container or conda to run the whole pipeline
3a. DOCKER: - Run some test workflows
4a. test the download:nextflow run QuackenbushLab/tcga-data-nf -profile <docker/conda>, testDownload
4a. test the prepare:nextflow run QuackenbushLab/tcga-data-nf -profile <docker/conda>, testPrepare
4a. test the analyze:nextflow run QuackenbushLab/tcga-data-nf -profile <docker/conda>, testAnalyze
4a. test the full workflow:nextflow run QuackenbushLab/tcga-data-nf -profile <docker/conda>, test
If you can run all these steps, you can procede defining your own configuration files and run your own analysis.
Check the docs for AWS for the steps on how to run the workflow on a simple EC2 instance. These steps could also help as a quickstart to check that you have everything up and running.
The docker container is hosted on docker.io.
docker pull violafanfani/tcga-data-nf:0.0.14
More details on the container can be found in the docs
Alternatively, one can run the workflow with conda environments.
In order to create and use conda one can pass it as a profile -profile conda
as:
nextflow run QuackebushLab/tcga-data-nf -profile conda,test ...
For the moment we are using one single environment to be used with all the r scripts. This allows the pipeline to generate the environment only once (which can be time consuming) and then to reuse it.
The three environment are inside the containers/conda_envs
folder:
- merge_tables
- r_all
- analysis
However, to improve portability, we use the process selectors labels to specify the different environments, allowing the user to specify their own environments too.
More details in the docs.
Before running the workflow we recommend pulling the last version with the following command
nextflow pull QuackenbushLab/tcga-data-nf
One can run the workflow by simply using the nextflow run
command and using a custom configuration file.
nextflow run QuackenbushLab/tcga-data-nf -c my-config.conf
Below we give more details on the configuration steps
First there are three main parameters that need to be passed by the user:
resultsDir = "results"
: general folder under which you want to find the results. This can directly reference an AWS S3 bucketbatchName = "my-batch"
: name of the run, this is gonna create a subfolder where the results are stored.pipeline = 'download'
: name of the pipeline, one of download,prepare,analyze,full
This way all data generated by the pipeline will be found inside the resultsDir/batchName/
folder.
If nothing is passed, all results will be in the results/my-batch
folder.
Secondly, you'll need a
For a full list of the configuration parameters check here.
Below is the structure you can expect in the output folder when you run each pipeline.
For each case we report the output of the testing profile:
- download pipeline:
-profile testDownload
- prepare pipeline:
-profile testPrepare
- analyze pipeline:
-profile testAnalyze
- full pipeline:
-profile test
Detailed output folder structure can be found at the docs
In case you wanted to make modifications to the workflow and/or run it locally
-
Fork the repo into your own github
-
Clone the forked nextflow repo
git clone git@github.com:myaccount/tcga-data-nf.git
-
Build docker locally
docker build . -f containers/Dockerfile --build-arg CONDA_FILE=containers/env.base.python.yml --no-cache -t my-tcga-data-nf:latest
-
In alternative, use the conda profile
-
Run your workflow
nextflow run . -profile testDownload --resultsDir myresults/ --pipeline download -with-docker my-tcga-data-nf:latest
Maintainer of the workflow:
- Viola Fanfani (vfanfani@hsph.harvard.edu)
Maintainer of NetworkDataCompanion:
- Kate Hoff Shutta
Other contributors:
- Panagiotis Mandros
- Jonas Fischer
- Soel Micheletti
- Enakshi Saha
- Chen Chen
The companion preprint is now on bioRxiv:
Reproducible processing of TCGA regulatory networks Viola Fanfani, Katherine H. Shutta, Panagiotis Mandros, Jonas Fischer, Enakshi Saha, Soel Micheletti, Chen Chen, Marouen Ben Guebila, Camila M. Lopes-Ramos, John Quackenbush bioRxiv 2024.11.05.622163; doi: https://doi.org/10.1101/2024.11.05.622163
@article {Fanfani2024.11.05.622163,
author = {Fanfani, Viola and Shutta, Katherine H. and Mandros, Panagiotis and Fischer, Jonas and Saha, Enakshi and Micheletti, Soel and Chen, Chen and Ben Guebila, Marouen and Lopes-Ramos, Camila M. and Quackenbush, John},
title = {Reproducible processing of TCGA regulatory networks},
elocation-id = {2024.11.05.622163},
year = {2024},
doi = {10.1101/2024.11.05.622163},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/11/07/2024.11.05.622163},
eprint = {https://www.biorxiv.org/content/early/2024/11/07/2024.11.05.622163.full.pdf},
journal = {bioRxiv}
}