BayVarC, a novel variant caller specifically designed for variant calling in liquid biopsy. It applies Bayesian inference to accurately quantify noise level in a locus specific manner, enabling the discrimination between technical noise and low-frequency cancer variants. Nevertheless, wet lab environment varies between laboratories and our evaluation cannot be exhaustive. To fully exploit BayVarC’s strength, we encourage users to train BayVarC error model based on their own experimental environment. During the error modeling, BayVarC applied a Bayesian inference which allows it to train using limited samples while ensure accurate estimation of locus specific error. Then, BayVarC compares the observed signal, indicative of candidate variant, against the posterior error rate as defined by the model. Subsequently, BayVarC employs Binomial testing at predefined significance level (alpha) to determine the nature of observed signal.
-
Python 3.7
-
Numpy >=1.16
-
Pandas >=0.23
-
Samtools >=1.7
-
Scipy
-
Bedtools
To see the help for the program, run:
<path_to_bayvarc_folder>/bin/BayVarC-Model -h
<path_to_bayvarc_folder>/bin/BayVarC -h
To train BayVarC error model based on their own experimental environment, the following is an example command to run:
<path_to_bayvarc_folder>/bin/BayVarC-Model -c <training.sample.list> -p <model_prefix> -cut <frequency_cutoff> -m <mappability.features.file> -t <trinucleotide.feature.file> -r <repeats.feature.file> -s <segmentDup.feature.file> -o <output_to_path>
The model generation test script is stored in file Create_model_test.sh
. The input data of tests is stored in folder Input data/
directory. The output model of the tests can be viewed in the output data/
directory, the position-specific error is store in file SNV_Model/<model_prefix> _snvs.posterior.xls
and InDel_Model/<model_prefix> _indels.posterior.xls
respectively.
To run variant caller pipeline from mpileup file, BayVarC employs Binomial testing at predefined significance alpha to determine the observed signal, the following is an example command to run:
<path_to_bayvarc_folder>/bin/BayVarC -i <mpileupfile> -s <sample> -m <error_model_dir> -p <error_model_prefix> -a <significance_level> -ins <insertion_length_threshold> -del <deletion_length_threshold> -d <minimum_total_depth> -c <minimum alt reads> -f <minimum mismatch frequency> -r <reference_file> -n <ncpu> -o <output_to_path>
The variants calling test script is stored in file Variant_calling_test.sh
. The output model of the tests can be viewed in the output data/
directory, the mutations with ‘PASS’ is accept as a confident somatic mutation in *filter.vcf
files.
For each sample, you can use samtools mpileup to generate the mpileup file and then trans to position-specific format statistic file for BayVarC-Model. The file format as follows:
CHR POSITION DEPTH REF A C T G
5 1295018 2089 T 7 0 0 10
5 1295019 2114 G 13 0 4 0
<trinucleotide.feature.file>
Each line include 4 columns, chromosome, position, reference and tri-nucleotide context. Tri-nucleotide context can be extracted from genome file. The file format as follows:
CHR POSITION REF TNT
5 1295018 C GCC
5 1295019 C CCG
<mappability.feature.file>
Each line include 4 columns, chromosome, position, reference and mappability tracks form UCSC, ,The file format as follows:
CHR POSITION REF ABI
5 1295018 C 1
5 1295019 C 1
<repeats.feature.file>
Each line include 4 columns, chromosome, position, reference and repeats elements form UCSC, The file format as follows:
CHR POSITION REF RepeatsMask
5 1295018 C NonRepeats
5 1295019 C NonRepeats
<segmentDup.feature.file>
Each line include 4 columns, chromosome, position, reference and segmental duplication form UCSC, The file format as follows:
CHR POSITION REF SegmentDup
5 1295018 C NonSegDup
5 1295019 C NonSegDup
The whole genome feature files can be download from UCSC and then use ‘bedtools intersect’ command to extract position-specific features according to BED files.
The code is freely available under the GNU license.
Dongxue Che
Cheng Yan
Yufei Yang