/SNVoter

Improving SNV calling from Nanopore Sequencing

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

SNVoter

Improving SNV detection from low coverage nanopore sequencing data (<30x. More notably <20x).
Akbari, V., Garant, JM., O’Neill, K. et al. Megabase-scale methylation phasing using nanopore long reads and NanoMethPhase. Genome Biol 22, 68 (2021). Access here

Note: SNVoter tested on Clair and older guppy (<v4.3) with R9.4.1 high accuracy (hac) 450bp model. If you used Clair3 and/or new guppy basecaller versions and models (e.g. super accuracy model) SNVoter might not offer improvment.

Table of Contents

Installation

NOTE: Before installation dependencies in environment.yaml must be installed. SNVoter uses several fixed versions of its dependencies in environment.yaml file . Users are encouraged to use a conda or similar environment to isolate the packages from their default python instance. Then activate the environment and use SNVoter.
You can make the conda environment and install all dependencies by running these lines of codes:

git clone https://github.com/vahidAK/SNVoter.git
cd SNVoter
conda env create -f envs/environment.yaml
conda activate snvoter

Now you can run snvoter python script in the SNVoter folder of the cloned repository:

python /path/to/cloned/SNVoterGithubRepo/snvoter.py -h

SNVoter Modules

prediction:

To predict if detected SNVs are true calls or false positives.

usage: snvoter prediction [-h] --input INPUT --bam BAM --reference REFERENCE
                          --output OUTPUT [--model_file MODEL_FILE]
                          [--mappingQuality MAPPINGQUALITY] [--depth DEPTH]
                          [--threads THREADS] [--chunk_size CHUNK_SIZE]

Predict based on a model.

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --input INPUT, -i INPUT
                        The path to the input vcf file.
  --bam BAM, -b BAM     The path to the alignment bam file
  --reference REFERENCE, -r REFERENCE
                        The path to the reference file. File must be indexed
                        by samtools faidx.
  --output OUTPUT, -o OUTPUT
                        The path to the output directory and prefix for output
                        file.

optional arguments:
  --model_file MODEL_FILE, -mf MODEL_FILE
                        Path to the trained model. Default is
                        NA12878_20FC_model.h5
  --mappingQuality MAPPINGQUALITY, -mq MAPPINGQUALITY
                        Cutt off for filtering out low quality mapped reads
                        from bam. Default is 0
  --depth DEPTH, -d DEPTH
                        Cutt off for filtering out regions with low depth to
                        have frequencies. Default >= 1
  --threads THREADS, -t THREADS
                        Number of threads. Default is 4.
  --chunk_size CHUNK_SIZE, -cs CHUNK_SIZE
                        Chunk size. Default is 100.

extraction:

Extract features to train a new model.

usage: snvoter extraction [-h] --input INPUT --mode_status MODE_STATUS --bam
                          BAM --reference REFERENCE
                          [--mappingQuality MAPPINGQUALITY] [--depth DEPTH]
                          [--threads THREADS] [--chunk_size CHUNK_SIZE]

Extract mutation frequencicies in 5-mer window.

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --input INPUT, -i INPUT
                        The path to the input vcf file.
  --mode_status MODE_STATUS, -ms MODE_STATUS
                        0 or 1. The status of the SNV in the input vcf. If the
                        SNVs are tru positive then select 1 if they are false
                        select 0.
  --bam BAM, -b BAM     The path to the alignment bam file
  --reference REFERENCE, -r REFERENCE
                        The path to the reference file. File must be indexed
                        by samtools faidx

optional arguments:
  --mappingQuality MAPPINGQUALITY, -mq MAPPINGQUALITY
                        Cutt off for filtering out low quality mapped reads
                        from bam. Default is 0
  --depth DEPTH, -d DEPTH
                        Cutt off for filtering out regions with low depth to
                        have frequencies. Default >=1
  --threads THREADS, -t THREADS
                        Number of threads
  --chunk_size CHUNK_SIZE, -cs CHUNK_SIZE
                        Number of sites send to each processes for parrallel
                        processing. Default is 50.

train:

To train a new model using extracted features.

usage: snvoter train [-h] --train TRAIN --test TEST --out_dir OUT_DIR
                     [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--plot]

train a new model

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --train TRAIN, -tr TRAIN
                        The path to the shuffled and ready file for training
  --test TEST, -te TEST
                        The path to the shuffled and ready file for testing.
  --out_dir OUT_DIR, -o OUT_DIR
                        Output directory and prefix for saving the model and
                        figures

optional arguments:
  --epochs EPOCHS, -e EPOCHS
                        Number of epochs. Default is 100
  --batch_size BATCH_SIZE, -batch BATCH_SIZE
                        batch size for model training. Default is 400.
  --plot, -plt          Select this option if you wish to output training
                        plots.

Tutorial

Variant Calling:

You first need to call variants using Clair

You can call variants for each chromosome using the following command and the concatenate all files:

for i in chr{1..22} chrX chrY; do callVarBam --chkpnt_fn <path to model file> --ref_fn <reference_genome.fa> --bam_fn <sorted_indexed.bam> --ctgName $i --sampleName <your sample name> --call_fn $i".vcf" --threshold 0.2 --samtools <path to executable samtools software> --pypy <path to executable pypy > --threads <number of threads>

For the full tutorial please refer to Clair page on GitHub.

NOTE: SNVoter needs SNV vcf file as input. So keep only SNVs from clair vcf file and use it as input (do not filter for variant call quality), for example:

awk '$4 != "." && $5 != "." && length($4) == 1 && length($5) == 1' Clair.vcf > SNVs_Clair.vcf

Improving SNV calling using SNVoter:

snvoter prediction -i <SNVs_Clair.vcf> -b <sorted_indexed.bam> -r <reference_genome.fa> -t number_of_threads -o output_prefix

It will produce two files.

1- Prediction file that includes each prediction for each 5-mer. The first 10 columns are from vcf file and the last eight columns indicate:

  • chrom: the chromosome name
  • pos_start: 0-based position of the 5-mer start
  • pos_end: 0-based position of the 5-mer end
  • pos: 0-based position of the SNV
  • window sequence from ref: sequence of window from reference
  • 5-mer sequence for window: 5-mer sequence for the reference window sequence
  • Coverage: this might be different from Clair's coverage as SNVoter uses different mapping quality threshold
  • Prediction

2- The second file is the ready vcf file with weighted qualities in the 6th column. You can plot the distribution of weighted quality to obtain optimal threshold for filtering. The plot usually looks like the following plots:

Quality distribution of 10x coverage data

Quality distribution of 18x coverage data

Quality distribution of 22x coverage data

The optimal threshold is the end of the first peak and start of the valley (highlighted regions).

By default SNVoter will use the model file trained by us using NA12878 20 flow cells and you do not need to specify path to the model if you want to use our model.

Train a New Model:

In order to train a new model you need to have two vcf files. One for true SNVs and the other for false positive SNV calls. Having these data allows you to then extract the features using snvoter extraction module. Subsequently, you can train a new model on extracted features using the snvoter train module.

Extracting Features:

snvoter extraction -i True_SNVs.vcf -b alignment.bam -r reference.fa -ms 1 -t 24 > Extracted_Features.csv
snvoter extraction -i False_SNVs.vcf -b alignment.bam -r reference.fa -ms 0 -t 24 >> Extracted_Features.csv

After extracting the features you need to shuffle the file.

cat Extracted_Features.csv | shuf | shuf | shuf > Shuffled_Extracted_Features.csv

Subsequently, seperate the shuffled file into training and test set as you wish. We recommend using at least 10% of suffled file as test set and the rest as training set

Training the Model

snvoter train -tr training_set.csv -te test_set.csv -o ./Trained_model --plot

Training will produce a .h5 file and a .h5.pkl file. If --plot option selected it will also output accuracy, precision, recall, loss, and ROC plots. In order to use this model via the snvoter prediction module, the .h5 and .h5.pkl files must be in the same directory and just give the path to the .h5 file using the --model_file flag.

Example

We have included an example data in the Example_data folder which you can use for a quick prediction.