Insights required in pipeline arguments and a feature request

Question

Insights required in pipeline arguments and a feature request

Rohit-Satyam opened this issue 2 years ago · 5 comments

What happened?

Hi developers. We have COVID amplicon sequence data obtained using V4.1 primers using Guppy version 5.1.13. My queries are

What are the arguments in wf-artic pipeline that I need to alter to get GISAID ready assembly.
I am unable to understand the utility of --normalise argument. Should I change it or use default. In some workflow such as McGills documentation this value is set to 800.
I wish to understand what filtering criteria is being applied by vcf_filter.py to the variant calls. Will it be valid to use gatk VariantFiltration on these filtered files (using the following code)? I am asking this because we have both Illumina and ONT runs for few samples and we wish to check overlaps of variants.

        gatk SelectVariants -V ${vcf.toRealPath()} --select-type-to-include SNP  -O ${sid}_snps.vcf
        gatk SelectVariants -V ${vcf.toRealPath()} --select-type-to-include INDEL  -O ${sid}_indels.vcf         
        gatk VariantFiltration -R ${params.ref} -V ${sid}_snps.vcf --filter-expression \"QD<2.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0\"  -filter-name  \"bad_SNP\" -O ${sid}_filtered.snps.vcf
        gatk VariantFiltration -R ${params.ref} -V ${sid}_indels.vcf --filter-expression \"QD<2.0 || FS > 200.0 || SOR > 10.0 || ReadPosRankSum < -20.0\"  -filter-name  \"bad_INDEL\" -O ${sid}_filtered.indels.vcf
        gatk SelectVariants --exclude-filtered true -V ${sid}_filtered.indels.vcf -O ${sid}_good_indels.vcf
        gatk SelectVariants --exclude-filtered true -V ${sid}_filtered.snps.vcf -O ${sid}_good_snps.vcf
        gatk MergeVcfs -I ${sid}_good_snps.vcf -I ${sid}_good_indels.vcf -O ${sid}.mergeSNPsIndels.vcf

Operating System

ubuntu 20.04

Workflow Execution

Command line

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - Execution Profile

Docker

Workflow Version

0.3.18

When I try --medaka_model with values r941_min_fast_g507 or r941_min_fast_g303 I get the following error:

ERROR: Validation of pipeline parameters failed!


* --medaka_model: r941_min_fast_g507 is not a valid enum value (r941_min_fast_g507)

However it runs fine with r941_min_fast_variant_g507. Why so?

5. `Feature Request:` It will be really helpful to have an option to get all files generated by nextclade (especially `tsv` file rather than `.json`) for sharing to non-informatics members of the lab.

Answer 1 · 2022-09-27T10:11:54.000Z

Hi @mattdmem can you please answer the aforementioned queries!

Answer 2 · 2022-09-27T10:24:16.000Z

This is not something I have ever done so cannot provide guidance. You should follow their requirements.
The default should be left alone. It is incorrect to set a value of 800, the variant calling models are untested at such high depth
vcf_filter.py is inherited from the original field-bioinformatics codebase. It provides some light filtering to remove the amend the most egregious calls.
The models that you have illustrated lead to an error are for consensus calling of reads with a draft assembly scaffold, the are not appropriate for variant calling.
The additional files can be found in the workspace directory, we do not publish them to the output directory as they are not commonly useful.

Answer 3 · 2022-09-27T17:42:56.000Z

So should I go with r941_min_fast_variant_g507 model or the default? We are using V4.1 primers and Guppy version 5.1.13.

Also, can we use this pipeline for Dengue Virus?

Answer 4 · 2022-09-27T19:37:44.000Z

If you used the fast variant of the basecaller then yes that is a suitable model. Else you may want the model with fast replaced by high in its name.

The workflow is specialised to SARS-CoV-2 by virtue of the handling of primer schemes

Answer 5 · 2022-09-27T22:50:33.000Z

Thanks for all the insights. One last query. I was running the pipeline today and idk why but the docker profile isn't working. Though the singularity one is working but the report produced has all failed samples with nothing in consensus fasta.

nextflow run epi2me-labs/wf-artic \
--fastq /home/subudhak/Documents/COVID_Project/KFSHRC_ALL_ONT_SAMPLES/KFSHRC_Batch1_PLATE_1_Samples_49-96/fastq_pass/ \
--out_dir /home/subudhak/Documents/COVID_Project/KFSHRC_ALL_ONT_SAMPLES/KFSHRC_Batch1_PLATE_1_Samples_49-96/wf-articresults \
--scheme_version ARTIC/V4.1 \
--medaka_model r941_min_fast_variant_g507 \
--pangolin_version 4.1.2 --update_data true --report_detailed true -profile docker
N E X T F L O W  ~  version 22.04.5
Unknown configuration profile: 'docker'

Update

I think this behaviour might be due to unavailability of enough memory for running analysis on samples in parallel because same samples passes when ran on cluster and generates full report. But is there a parameter like maxjobs that can be defined to limit number of parallel jobs running?