nextflow-io/training

gghist.R does not exist in cbcrg/callings-with-gatk:latest

Xophmeister opened this issue · 2 comments

Step 6B in the RNA-seq variant calling pipeline work-through uses the DockerHub container cbcrg/callings-with-gatk:latest. This image exists, but it does not contain the executable script gghist.R, which is used in the process script. Thus the workflow fails.

Code

process prepare_vcf_for_ase {
container 'cbcrg/callings-with-gatk:latest'
tag "${sampleId}"
publishDir "${params.results}/${sampleId}"
input:
tuple val(sampleId),
path('final.vcf'),
path('commonSNPs.diff.sites_in_files')
output:
tuple val(sampleId), path('known_snps.vcf'), emit: vcf_for_ASE
path 'AF.histogram.pdf' , emit: gghist_pdfs
script:
'''
awk 'BEGIN{OFS="\t"} $4~/B/{print $1,$2,$3}' commonSNPs.diff.sites_in_files > test.bed
vcftools --vcf final.vcf --bed test.bed --recode --keep-INFO-all \
--stdout > known_snps.vcf
grep -v '#' known_snps.vcf | awk -F '\\t' '{print $10}' \
| awk -F ':' '{print $2}' | perl -ne 'chomp($_); \
@v=split(/\\,/,$_); if($v[0]!=0 ||$v[1] !=0)\
{print $v[1]/($v[1]+$v[0])."\\n"; }' | awk '$1!=1' \
> AF.4R
gghist.R -i AF.4R -o AF.histogram.pdf
'''
}

Documentation (first mention; repeated elsewhere in this file)

process prepare_vcf_for_ase {
container 'cbcrg/callings-with-gatk:latest'
tag "${sampleId}"
publishDir "${params.results}/${sampleId}"
input:
tuple val(sampleId),
path('final.vcf'),
path('commonSNPs.diff.sites_in_files')
output:
tuple val(sampleId), path('known_snps.vcf'), emit: vcf_for_ASE
path 'AF.histogram.pdf' , emit: gghist_pdfs
script:
'''
awk 'BEGIN{OFS="\t"} $4~/B/{print $1,$2,$3}' \
commonSNPs.diff.sites_in_files > test.bed
vcftools --vcf final.vcf --bed test.bed --recode --keep-INFO-all \
--stdout > known_snps.vcf
grep -v '#' known_snps.vcf | awk -F '\\t' '{print $10}' \
| awk -F ':' '{print $2}' | perl -ne 'chomp($_); \
@v=split(/\\,/,$_); if($v[0]!=0 ||$v[1] !=0)\
{print $v[1]/($v[1]+$v[0])."\\n"; }' | awk '$1!=1' \
>AF.4R
gghist.R -i AF.4R -o AF.histogram.pdf
'''
}

...Ah, my mistake, this script should be in the workflow's bin directory. It's briefly mentioned in the setup chapter:

```bash
/workspace/gitpod/hands-on
hands-on
├── README.md
├── bin
│   └── gghist.R
├── data
│   ├── blacklist.bed
│   ├── genome.fa
│   ├── known_variants.vcf.gz
│   └── reads
│   ├── ENCSR000COQ1_1.fastq.gz
│   ├── ENCSR000COQ1_2.fastq.gz
│   ├── ENCSR000COQ2_1.fastq.gz
│   ├── ENCSR000COQ2_2.fastq.gz
│   ├── ENCSR000COR1_1.fastq.gz
│   ├── ENCSR000COR1_2.fastq.gz
│   ├── ENCSR000COR2_1.fastq.gz
│   ├── ENCSR000COR2_2.fastq.gz
│   ├── ENCSR000CPO1_1.fastq.gz
│   ├── ENCSR000CPO1_2.fastq.gz
│   ├── ENCSR000CPO2_1.fastq.gz
│   └── ENCSR000CPO2_2.fastq.gz
├── final_main.nf
└── nextflow.config
```

Maybe this could be made clearer?

You're correct. It should be made more explicit. Thanks, @Xophmeister! 😉

Do you want to have a try with a PR?