/samla

Examine collection of parallel VCF files and produce consensus variant calls

Primary LanguageC++

Samla

Collect variants from parallel VCF files, apply heuristics, and produce a consensus set of variants.

Samla's --method gwa and --method gwa-ksp have been developed to produce consensus variants from those produced by different types of sequencing runs of the same individual. Specifically, input is three GATK all-sites VCF files based on three types of sequencing data: (1) reads from whole-genome-shotgun sequencing; (2) reads from sequencing following whole-genome amplification; and (3) both sets of reads combined.

Coverage of data types (1) may be insufficient for high-confidence genotyping, hence the creation of (2). However, coverage can be highly variable in (2). Because the error models for variant calling are different between (1) and (2), it is not correct to use genotypes from (3) without further investigation.

Samla produces a consensus all-sites VCF file after examining the combined strength of support for both reference-matching and variant sites using all three VCF files. The consensus VCF may contain modest gains in variants, on the order of a few percentage points, but both reference and variant sites will be of better quality: those with consistent support will be strengthened, while those with inconsistent support will be weakened.

This repository links to my local fork of the vcflib repository. If you are cloning this, please get your local copy of the vcflib repository with

git clone --recursive https://github.com/douglasgscofield/samla.git

More information on usage once downloaded is available via samla --help. Once downloaded, do

cd samla ; make
Usage:   samla [options] -r refnames.txt <in1.vcf> [ in2.vcf ... ]

Collect VCF results and produce a consensus list of variants.

NOTE: samla 0.1.1-1-g9f77aea-52 is under active development.

     --references FILE            file containing reference names in order [REQUIRED]
                                  should be in the same order as the VCF files
     --output FILE                output file name [default is stdout]
     --debug INT                  debug info level INT [0]
     --progress INT               print variants processed mod INT [0]

     --filter-annotate            annotate FILTER field with additional method-specific information
     --full-filter-annotate       annotate FILTER field with even more method-specific information
     --no-filter-annotate         do not annotate FILTER field, use only PASS and filters indicating failure described in the VCF header [set]

     --method METHOD              use combining method 'METHOD', only 'gwa' and 'gwa-ksp' are implemented

'gwa' method options:

     --gwa-window INT                 Lookback window size for mean quality, max 50[5]
     --gwa-quality FLOAT              Minimum quality when combining both VQSR and LowQual variants [30]
                                      Specifying this option will set all the quality values below to the given value.
     --gwa-quality-ref FLOAT          Minimum quality when combining variants and call matches reference [20]
                                      Specifying this option will also set --gwa-lowqual-quality-ref and --gwa-mixed-quality-ref.
     --gwa-vqsr-quality FLOAT         Minimum quality when combining VQSR variants [30]
     --gwa-lowqual-quality FLOAT      Minimum quality when combining LowQual variants and call does not match reference [30]
     --gwa-lowqual-quality-ref FLOAT  Minimum quality to meet when combining LowQual variants and call matches reference [20]
     --gwa-mixed-quality FLOAT        Minimum quality when combining VQSR with LowQual variants [30]
     --gwa-mixed-quality-ref FLOAT    Minimum quality to meet when combining VQSR with LowQual variants and call matches reference [20]

     --gwa-force-consistency          Require G, W and A to agree on variant/no-variant for potentially ambiguous cases 4 and 5 [set]
     --gwa-no-force-consistency       Do not require G, W and A to agree on variant/no-variant for potentially ambiguous cases 4 and 5

     --gwa-disable-context-quality    Disables usage of context quality, qualities instead compared directly [set]
     --gwa-enable-context-quality     Enables usage of context quality
                                      NOTE: context quality is very experimental and may be incorrect, use at you rown risk

     --gwa-vqsr-vqsr-fail             Mark as FAIL all cases having both genomic and WGA VQSR-filtered variants [set]
     --gwa-vqsr-vqsr-normal           PASS/FAIL cases having both genomic and WGA VQSR-filtered variants depending on culprits
     --gwa-lowqual-lowqual-fail       Mark as FAIL all cases having both genomic and WGA LowQual-filtered variants
     --gwa-lowqual-lowqual-normal     PASS/FAIL cases having both genomic and WGA LowQual-filtered variants depending on culprits [set]
     --gwa-mixed-fail                 Mark as FAIL all cases having both a VQSR-filtered and a LowQual-filtered variant [set]
     --gwa-mixed-normal               PASS/FAIL cases having both a VQSR-filtered and a LowQual-filtered variant
                                      --gwa-vqsr-lowqual-fail and --gwa-vqsr-lowqual-normal are synonyms for the --gwa-mixed-* options
     --gwa-case8-emit-all             For case 8 (one of G/W has a variant while the other does not), emit the A call for the site [set]
                                      When the quality of one library (often W) is poor, this option may help reduce bias at the
                                      probable cost of a slight increase in failing case 8 sites for samples with better libraries.
     --gwa-case8-no-emit-all          For case 8 (one of G/W has a variant while the other does not), emit the variant


The 'gwa-ksp' method sets the following options. Options appearing after this may make further changes to option values.

     --method gwa
     --gwa-quality 30
     --gwa-quality-ref 20
     --gwa-force-consistency
     --gwa-disable-context-quality
     --gwa-vqsr-vqsr-fail
     --gwa-mixed-fail
     --gwa-case8-emit-all


For methods 'gwa' and 'gwa-ksp', all VCF files must be specified using these options:

     --vcf-genomic FILE           VCF file containing genomic calls
     --vcf-wga FILE               VCF file containing whole-genome-amplified calls
     --vcf-all FILE               VCF file containing all (pooled) calls


     -h | -? | --help             help

Version:     samla 0.1.1-1-g9f77aea-52

Compiler:    g++ (MacPorts gcc48 4.8.3_4) 4.8.3
Build flags: -Ivcflib -Wall -D_FILE_OFFSET_BITS=64 -O0 -D_WITH_DEBUG -ggdb -g3 -fvar-tracking-assignments -fno-inline -fno-inline-small-functions -fno-eliminate-unused-debug-types