gghist.R does not exist in cbcrg/callings-with-gatk:latest

Step 6B in the RNA-seq variant calling pipeline work-through uses the DockerHub container cbcrg/callings-with-gatk:latest. This image exists, but it does not contain the executable script gghist.R, which is used in the process script. Thus the workflow fails.

Code

training/hands-on/final_main.nf

Lines 309 to 338 in d46bd35

    
           process prepare_vcf_for_ase { 
        
               container 'cbcrg/callings-with-gatk:latest' 
        
               tag "${sampleId}" 
        
               publishDir "${params.results}/${sampleId}" 
        
               input: 
        
               tuple val(sampleId), 
        
                     path('final.vcf'), 
        
                     path('commonSNPs.diff.sites_in_files') 
        
               output: 
        
               tuple val(sampleId), path('known_snps.vcf'), emit: vcf_for_ASE 
        
               path 'AF.histogram.pdf'                    , emit: gghist_pdfs 
        
               script: 
        
               ''' 
        
               awk 'BEGIN{OFS="\t"} $4~/B/{print $1,$2,$3}' commonSNPs.diff.sites_in_files  > test.bed 
        
               vcftools --vcf final.vcf --bed test.bed --recode --keep-INFO-all \ 
        
                        --stdout > known_snps.vcf 
        
               grep -v '#' known_snps.vcf | awk -F '\\t' '{print $10}' \ 
        
                           | awk -F ':' '{print $2}' | perl -ne 'chomp($_); \ 
        
                           @v=split(/\\,/,$_); if($v[0]!=0 ||$v[1] !=0)\ 
        
                           {print  $v[1]/($v[1]+$v[0])."\\n"; }' | awk '$1!=1' \ 
        
                           > AF.4R 
        
               gghist.R -i AF.4R -o AF.histogram.pdf 
        
               ''' 
        
           }

Documentation (first mention; repeated elsewhere in this file)

training/docs/hands_on/04_implementation.md

Lines 1346 to 1376 in d46bd35

    
           process prepare_vcf_for_ase { 
        
               container 'cbcrg/callings-with-gatk:latest' 
        
               tag "${sampleId}" 
        
               publishDir "${params.results}/${sampleId}" 
        
               input: 
        
               tuple val(sampleId), 
        
                     path('final.vcf'), 
        
                     path('commonSNPs.diff.sites_in_files') 
        
               output: 
        
               tuple val(sampleId), path('known_snps.vcf'), emit: vcf_for_ASE 
        
               path 'AF.histogram.pdf'                    , emit: gghist_pdfs 
        
               script: 
        
               ''' 
        
               awk 'BEGIN{OFS="\t"} $4~/B/{print $1,$2,$3}' \ 
        
                   commonSNPs.diff.sites_in_files  > test.bed 
        
               vcftools --vcf final.vcf --bed test.bed --recode --keep-INFO-all \ 
        
                        --stdout > known_snps.vcf 
        
               grep -v '#'  known_snps.vcf | awk -F '\\t' '{print $10}' \ 
        
                           | awk -F ':' '{print $2}' | perl -ne 'chomp($_); \ 
        
                           @v=split(/\\,/,$_); if($v[0]!=0 ||$v[1] !=0)\ 
        
                           {print  $v[1]/($v[1]+$v[0])."\\n"; }' | awk '$1!=1' \ 
        
                           >AF.4R 
        
               gghist.R -i AF.4R -o AF.histogram.pdf 
        
               ''' 
        
           }

...Ah, my mistake, this script should be in the workflow's bin directory. It's briefly mentioned in the setup chapter:

training/docs/hands_on/03_setup.md

Lines 23 to 48 in d46bd35

    
           ```bash 
        
           /workspace/gitpod/hands-on 
        
           hands-on 
        
           ├── README.md 
        
           ├── bin 
        
           │   └── gghist.R 
        
           ├── data 
        
           │   ├── blacklist.bed 
        
           │   ├── genome.fa 
        
           │   ├── known_variants.vcf.gz 
        
           │   └── reads 
        
           │       ├── ENCSR000COQ1_1.fastq.gz 
        
           │       ├── ENCSR000COQ1_2.fastq.gz 
        
           │       ├── ENCSR000COQ2_1.fastq.gz 
        
           │       ├── ENCSR000COQ2_2.fastq.gz 
        
           │       ├── ENCSR000COR1_1.fastq.gz 
        
           │       ├── ENCSR000COR1_2.fastq.gz 
        
           │       ├── ENCSR000COR2_1.fastq.gz 
        
           │       ├── ENCSR000COR2_2.fastq.gz 
        
           │       ├── ENCSR000CPO1_1.fastq.gz 
        
           │       ├── ENCSR000CPO1_2.fastq.gz 
        
           │       ├── ENCSR000CPO2_1.fastq.gz 
        
           │       └── ENCSR000CPO2_2.fastq.gz 
        
           ├── final_main.nf 
        
           └── nextflow.config 
        
           ```

Maybe this could be made clearer?

You're correct. It should be made more explicit. Thanks, @Xophmeister! 😉

Do you want to have a try with a PR?

	process prepare_vcf_for_ase {
	container 'cbcrg/callings-with-gatk:latest'
	tag "${sampleId}"
	publishDir "${params.results}/${sampleId}"

	input:
	tuple val(sampleId),
	path('final.vcf'),
	path('commonSNPs.diff.sites_in_files')

	output:
	tuple val(sampleId), path('known_snps.vcf'), emit: vcf_for_ASE
	path 'AF.histogram.pdf' , emit: gghist_pdfs

	script:
	'''
	awk 'BEGIN{OFS="\t"} $4~/B/{print $1,$2,$3}' commonSNPs.diff.sites_in_files > test.bed

	vcftools --vcf final.vcf --bed test.bed --recode --keep-INFO-all \
	--stdout > known_snps.vcf

	grep -v '#' known_snps.vcf \| awk -F '\\t' '{print $10}' \
	\| awk -F ':' '{print $2}' \| perl -ne 'chomp($_); \
	@v=split(/\\,/,$_); if($v[0]!=0 \|\|$v[1] !=0)\
	{print $v[1]/($v[1]+$v[0])."\\n"; }' \| awk '$1!=1' \
	> AF.4R

	gghist.R -i AF.4R -o AF.histogram.pdf
	'''
	}

	```bash
	/workspace/gitpod/hands-on
	hands-on
	├── README.md
	├── bin
	│ └── gghist.R
	├── data
	│ ├── blacklist.bed
	│ ├── genome.fa
	│ ├── known_variants.vcf.gz
	│ └── reads
	│ ├── ENCSR000COQ1_1.fastq.gz
	│ ├── ENCSR000COQ1_2.fastq.gz
	│ ├── ENCSR000COQ2_1.fastq.gz
	│ ├── ENCSR000COQ2_2.fastq.gz
	│ ├── ENCSR000COR1_1.fastq.gz
	│ ├── ENCSR000COR1_2.fastq.gz
	│ ├── ENCSR000COR2_1.fastq.gz
	│ ├── ENCSR000COR2_2.fastq.gz
	│ ├── ENCSR000CPO1_1.fastq.gz
	│ ├── ENCSR000CPO1_2.fastq.gz
	│ ├── ENCSR000CPO2_1.fastq.gz
	│ └── ENCSR000CPO2_2.fastq.gz
	├── final_main.nf
	└── nextflow.config
	```