Insights required in pipeline arguments and a feature request
Rohit-Satyam opened this issue · 5 comments
What happened?
Hi developers. We have COVID amplicon sequence data obtained using V4.1 primers using Guppy version 5.1.13
. My queries are
- What are the arguments in
wf-artic
pipeline that I need to alter to get GISAID ready assembly. - I am unable to understand the utility of
--normalise
argument. Should I change it or use default. In some workflow such as McGills documentation this value is set to800
. - I wish to understand what filtering criteria is being applied by
vcf_filter.py
to the variant calls. Will it be valid to usegatk VariantFiltration
on these filtered files (using the following code)? I am asking this because we have both Illumina and ONT runs for few samples and we wish to check overlaps of variants.
gatk SelectVariants -V ${vcf.toRealPath()} --select-type-to-include SNP -O ${sid}_snps.vcf
gatk SelectVariants -V ${vcf.toRealPath()} --select-type-to-include INDEL -O ${sid}_indels.vcf
gatk VariantFiltration -R ${params.ref} -V ${sid}_snps.vcf --filter-expression \"QD<2.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0\" -filter-name \"bad_SNP\" -O ${sid}_filtered.snps.vcf
gatk VariantFiltration -R ${params.ref} -V ${sid}_indels.vcf --filter-expression \"QD<2.0 || FS > 200.0 || SOR > 10.0 || ReadPosRankSum < -20.0\" -filter-name \"bad_INDEL\" -O ${sid}_filtered.indels.vcf
gatk SelectVariants --exclude-filtered true -V ${sid}_filtered.indels.vcf -O ${sid}_good_indels.vcf
gatk SelectVariants --exclude-filtered true -V ${sid}_filtered.snps.vcf -O ${sid}_good_snps.vcf
gatk MergeVcfs -I ${sid}_good_snps.vcf -I ${sid}_good_indels.vcf -O ${sid}.mergeSNPsIndels.vcf
Operating System
ubuntu 20.04
Workflow Execution
Command line
Workflow Execution - EPI2ME Labs Versions
No response
Workflow Execution - Execution Profile
Docker
Workflow Version
0.3.18
- When I try
--medaka_model
with valuesr941_min_fast_g507
orr941_min_fast_g303
I get the following error:
ERROR: Validation of pipeline parameters failed!
* --medaka_model: r941_min_fast_g507 is not a valid enum value (r941_min_fast_g507)
However it runs fine with r941_min_fast_variant_g507
. Why so?
5. `Feature Request:` It will be really helpful to have an option to get all files generated by nextclade (especially `tsv` file rather than `.json`) for sharing to non-informatics members of the lab.
Hi @mattdmem can you please answer the aforementioned queries!
- This is not something I have ever done so cannot provide guidance. You should follow their requirements.
- The default should be left alone. It is incorrect to set a value of 800, the variant calling models are untested at such high depth
vcf_filter.py
is inherited from the original field-bioinformatics codebase. It provides some light filtering to remove the amend the most egregious calls.- The models that you have illustrated lead to an error are for consensus calling of reads with a draft assembly scaffold, the are not appropriate for variant calling.
- The additional files can be found in the workspace directory, we do not publish them to the output directory as they are not commonly useful.
So should I go with r941_min_fast_variant_g507
model or the default? We are using V4.1 primers and Guppy version 5.1.13.
Also, can we use this pipeline for Dengue Virus?
If you used the fast variant of the basecaller then yes that is a suitable model. Else you may want the model with fast replaced by high in its name.
The workflow is specialised to SARS-CoV-2 by virtue of the handling of primer schemes
Thanks for all the insights. One last query. I was running the pipeline today and idk why but the docker profile isn't working. Though the singularity one is working but the report produced has all failed samples with nothing in consensus fasta.
nextflow run epi2me-labs/wf-artic \
--fastq /home/subudhak/Documents/COVID_Project/KFSHRC_ALL_ONT_SAMPLES/KFSHRC_Batch1_PLATE_1_Samples_49-96/fastq_pass/ \
--out_dir /home/subudhak/Documents/COVID_Project/KFSHRC_ALL_ONT_SAMPLES/KFSHRC_Batch1_PLATE_1_Samples_49-96/wf-articresults \
--scheme_version ARTIC/V4.1 \
--medaka_model r941_min_fast_variant_g507 \
--pangolin_version 4.1.2 --update_data true --report_detailed true -profile docker
N E X T F L O W ~ version 22.04.5
Unknown configuration profile: 'docker'
Update
I think this behaviour might be due to unavailability of enough memory for running analysis on samples in parallel because same samples passes when ran on cluster and generates full report. But is there a parameter like maxjobs
that can be defined to limit number of parallel jobs running?