Project 1: needlestack variant calling
tdelhomme opened this issue · 4 comments
Run needlestack on TCGA data.
Given a cohort, or a center, run on one gene, and then on a whole BED file with parallelization.
Project source code and documentation is hosted here.
Todo list:
- Create docker file
- Run needlestack without Nextflow: bash script needlestack.sh
- Run needlestack on tumor-normal pairs: create a txt file containing the tumor normal pairs (use bam files metadata to retrieve TCGA barcodes)
- Parallelization: create a bed file and a script to merge the vcf files
Maybe the needlestack dockerfile on dockerhub is ok for the bash version, need to ne checked.
We created a new docker file in needlestack/dev/bin based on the needlestack dockerfile adding wget of the R scripts dependencies and the hg19/38 chromosomeNames2UCSC.txt
To parallelize needlestack in a single task we can maybe use the scatter option which is different from the batch mode: https://docs.cancergenomicscloud.org/docs/about-parallelizing-tool-executions
https://docs.cancergenomicscloud.org/v1.0/blog/making-efficient-use-of-compute-resources#section-when-being-scattered-is-a-very-good-thing-optimising-a-whole-genome-analysis