VaQuERo v2

SARS-CoV-2 (Va)riant (Qu)antification in s(E)wage, designed for (Ro)bustness

Version History

The newer version of VaQuERo, denoted VaQuERo_v2.r allows to deconvolute also hierarchically nested variants. This allows to have a more fine grained lineages resolution despite the increasing diversity of simultaneously circulating lineages. We encourage all users to switch to the newest version of the tool and the marker mutation. The old version is kept for backward compability but no more actively maintained.

Synopsis

Rscript scripts/VaQuERo_v2.r > vaquero.log

Description

Takes allele frequencies in table formate and associated meta data and quantifies different virus variants as defined in provided marker mutation file. The workflow first calls 'detected' variants and subsequently quantifies the abundance of the detected variants based on the observed non-zero frequencies using a SIMPLEX regression.

Testing

For testing if all required packages are available, please run

time Rscript VaQuERo/scripts/VaQuERo_v2.r --debug TRUE

Usage

scripts/VaQuERo_v2.r [options]

Options

--dir=CHARACTER
	Directory to write results [default: ExampleOutput]

--country=CHARACTER
	Name of country used to produce map [default: Austria]

--bbsouth=CHARACTER
	Bounding box most south point [default: 46.38]

--bbnorth=CHARACTER
	Bounding box most norther point [default: 49.01]

--bbwest=CHARACTER
	Bounding box most western point [default: 9.53]

--bbeast=CHARACTER
	Bounding box most easter point [default: 17.15]

--metadata=CHARACTER
	Path to meta data input file [default: data/metaDataSub.tsv]

--marker=CHARACTER
	Path to marker mutation input file [default: resources/mutations_list.csv]

--smarker=CHARACTER
	Path to special mutation input file [default: resources/mutations_special.csv]

--pmarker=CHARACTER
            Path to problematic mutations input file, which will be omitted throughout [default: resources/mutations_problematic_all.csv]

--data=CHARACTER
	Path to data input file [default: data/mutationDataSub.tsv]
	
--data2=CHARACTER
	Deprecated, for backwards compatibility only. Path to data input file in deprected old sparse table file format. Only effective together with --inputformat=sparse. [default: data/mutationDataSub.tsv]

   --inputformat=CHARACTER
            For backwards compatibility, set to 'sparse' to use input from --data2. Description of deprecated sparse table file format see file deprecated_input_specification.md. [default: tidy]

--plotwidth=CHARACTER
	Base size of plot width [default: 8]

--plotheight=CHARACTER
	Base size of plot height [default: 4.5]

--ninconsens=CHARACTER
	Minimal fraction of genome covered by reads to be considered (0-1) [default: 0.4]

--zero=DOUBLE
	Minimal allele frequency to be considered [default: 0.02]

--depth=CHARACTER
	Minimal depth at mutation locus to be considered [default: 75]

--recent=CHARACTER
	How old (in days) most recent sample might be to be still considered in overview maps [default: 99]

--plottp=CHARACTER
	Produce timecourse plots only if more than this timepoints are available [default: 3]

--minuniqmark=CHARACTER
	Minimal absolute number of uniq markers that variant is considered detected [default: 3]

--minuniqmarkfrac=CHARACTER
	Minimal fraction of uniq markers that variant is considered detected [default: 0]

--minqmark=CHARACTER
	Minimal absolute number of markers that variant is considered detected [default: 3]

--minmarkfrac=CHARACTER
	Minimal fraction of markers that variant is considered detected [default: 0]

--smoothingsamples=CHARACTER
	Number of previous timepoints use for smoothing [default: 1]

--smoothingtime=CHARACTER
	Previous timepoints for smoothing are ignored if more days than this days apart [default: 8]

--voi=CHARACTER
	List of variants which should be plotted in more detail. List separated by semicolon [default: B.1.1.7;B.1.617.2;P.1;B.1.351]

--highlight=CHARACTER
	List of variants which should be plotted at the bottom axis. List separated by semicolon [default: B.1.1.7;B.1.617.2]
--colorBase=CHARACTER
	List of base variants. All offsprings of these variants will be assigned with similar colors in the respective plots [default: BA.1;BA.2;BA.5]

-h, --help
	Show help message and exit.
	
--debug=LOGIC
	Set to TRUE to use provided input example file [default: FALSE].

Input

allele frequency file

A TAB separated tidy table, specifying for each mutation and each sample its allele frequency and sequencing depth. The helper script vcf2tsv_long.py assists to create this file directly from a list of vcf files.

Col	HEADER	EXAMPLE VALUE	DESCRIPTION
$1	SAMPLEID	SAMPLE1	sample name
$2	CHROM	NC_045512.2	chromosome name
$3	POS	3002	position
$4	REF	G	reference base
$5	ALT	T	alternative base
$6	GENE	ORF1ab	annotation gene
$7	AA	E913stop	AA substitution
$8	AF	0.997152	Allele frequency
$9	DP	16855	Sequencing depth
$10	PQ	49314	positional quality

A previously used sparse table file format is still supported for backwards compatibility. See --data2 and --inputformat for details how to invoke it. See the file deprecated_input_specification.md for a specification of this file format.

vcf2tsv_long.py

vcf2tsv_long.py needs the package pysam to be installed (eg. via pip or conda).

It takes a vcf file, a directory with vcfs, or a list with paths to vcf files as an input and creates a tidy table fit as input for VaQuERo.

You can test it with the vcf files in data/vcf2tsv in the following ways on files with and without SNPEFF annotations, merged samples, all vcf files in a directory or a list of files:

python3 scripts/vcf2tsv_long.py -i data/vcf2tsv/CoV_29633_S117400.vcf.gz | less
python3 scripts/vcf2tsv_long.py -i data/vcf2tsv/CoV_29634_S117319_ann.vcf | less
python3 scripts/vcf2tsv_long.py -i data/vcf2tsv/merged_samples.vcf.gz | less
python3 scripts/vcf2tsv_long.py -i data/vcf2tsv/ | less
python3 scripts/vcf2tsv_long.py -i data/vcf2tsv/vcf_files.txt | less

To redirect output to a new file, append to an existing file and filter with a certain minimal allele frequency (default 0.01):

python3 scripts/vcf2tsv_long.py -i data/vcf2tsv/merged_samples.vcf.gz -m 0.1 -o test_input.tsv.gz
python3 scripts/vcf2tsv_long.py -i data/vcf2tsv/CoV_29634_S117319_ann.vcf --append -m 0.1 -o test_input.tsv.gz

meta data file

The meta data file connects each sample as defined in the mutation file with the sampling location and time. Each samples included in allele frequency file must be specified here.

Col	HEADER	EXAMPLE VALUE	DESCRIPTION
$1	BSF_run	BSF_0895	Sequencing batch
$2	BSF_sample_name	SAMPLE1	Seq. sample name
$3	BSF_start_date	2021-01-18	Seq. date
$4	LocationID	ATTP_10-Krezlin	Sample loction ID
$5	LocationName	Krezlin	Sample location name
$6	N_in_Consensus	125	Nr. Of N in consensus seq
$7	RNA_ID_int	SAMPLE1	Sample name
$8	additional_information	Ct = xx.2	any add. info
$9	adress_town	Krezlin	Sampling Town
$10	connected_people	1234567	Nr. of conncected people
$11	dcpLatitude	48.50452605	latitude
$12	dcpLongitude	11.70669547	lingitude
$13	include_in_report	TRUE	should be reported (F
$14	report_category	fictional	included in which report
$15	sample_date	2020-12-20	sample date
$16	status	passed_qc	status (passed

marker mutation definition

The marker mutation file links the mutation expected to be seen in a variant (sensitivity) for all variants. Table is comma separated. If one mutation is sensitive for more than one variants, a list of variants (semicolon separated) can be provided in columns 1. Same format applies to special mutations which are used for plotting and problematic sites which are ignored during analysis.

Col	HEADER	EXAMPLE VALUE	DESCRIPTION
$1	Variants	AV.1;B.1.1.318	List of Variants (seperated by ";")
$2	Chromosom	NC_045512.2	Chromosome name
$3	Postion	21990	Chromosome position
$4	REF	TTTA	Reference base(s)
$5	ALT	T	Alternative base(s)
$6	Gene	S	Gene
$7	Sensitivities	1;0.87;0.96	Sensitivity in each of the variants
$8	AA	S:Y144del	Mutation in AA nomenclature
$9	NUC	TTTA21990T	Mutation in nucc nomenclature

Output

globalFittedData.csv

Table holding the deduced variant frequencies.

globalFullData.csv

Table holding the observed allele frequencies and the deduced variant frequencies.

globalSpecialmutData.csv

Table holding the observed allele frequencies and the allels specified in special mutation file.

summary.csv

List of all plots produced.

Figure directory

Timecourse plots and map represention of results, to be included in a report.

Install

The following R packages must be pre-installed:

tidyr
ggplot2
reshape2
dplyr
data.table
gamlss
ggmap
tmaptools
ggrepel
scales
betareg
ggspatial
sf
rnaturalearth
rnaturalearthdata
optparse
stringr

For convenience a Rscupt for package installation is provided (Rscript R_package_dependency_install.r)

fabou-uobaf/VaQuERo