/sv-channels

Deep learning-based structural variant filtering method

Primary LanguagePythonApache License 2.0Apache-2.0

sv-channels

DOI CI

sv-channels is a Deep Learning workflow for filtering structural variants (SVs) in short read alignment data using a one-dimensional Convolutional Neural Network (CNN). Currently, only deletions (DEL) called with Manta are supported. The workflow includes the following key steps:

Transform read alignments into channels

For each pair of SV breakpoints, a 2D Numpy array called window-pair is constructed. The shape of a window is [window_size*2+buffer_size, number_of_channels], where the genomic interval encompassing each window is centered on the breakpoint position with a context of [-window_size/2, +window_size/2]. window_size is 124 bp by default. From all the reads overlapping this genomic interval and from the relative segment subsequence of the reference sequence number_of_channels channels are constructed, where each channel encode a signal that can be used for SV calling. The list of channels can be found here. The two windows are joined as window-pair with a buffer, a 2D array of zeros with shape [8, number_of_channels] in between to avoid artifacts related to the CNN kernel passing at the interface between the two windows. The window-pairs are labelled as DEL when the breakpoint positions overlap the DEL callset used as ground truth and noDEL otherwise.

Labelling

Window-pairs are labelled as DEL (a true deletion) or noDEL (a false positive call) based on the overlap of the DEL breakpoints of the window-pair with the truth set.

Model training

The labelled window-pairs are used to train a 1D CNN to classify Manta SVs as either DEL (true deletions) or noDEL (false positives).

Scoring Manta DELs

The model is run on the window-pairs of a test sample. The SV qualities for the Manta DELs (QUAL) of the test sample are substituted with the posterior probabilities obtained by the model.

Dependencies

1. Clone this repo.

git clone https://github.com/GooglingTheCancerGenome/sv-channels.git
cd sv-channels

2. Install dependencies.

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba
# create a new environment with dependencies & activate it
mamba env create -n sv-channels -f environment.yaml
conda activate sv-channels
# install svchannels CLI
python setup.py install

3. Execution.

  • input:
    • read alignment in BAM format
    • reference genome used to map the reads in FASTA format
  • output:
    • SV callset generated by Manta in VCF format

Run on test data

  1. Extract signals.
svchannels extract-signals reference.fasta sample.bam -o signals
  1. Convert VCF file (Manta callset) to BEDPE format.
Rscript svchannels/utils/R/vcf2bedpe.R -i manta.vcf -o manta.bedpe
  1. Generate channels.
svchannels generate-channels --reference reference.fasta signals channels manta.bedpe
  1. Use the pretrained model.

  2. Score SVs.

svchannels score channels model.keras manta.vcf sv-channels.vcf

Train a new model

  1. Extract signals.
svchannels extract-signals reference.fasta training_sample.bam -o signals
  1. Convert VCF files (Manta callset and truth set) to BEDPE format.
Rscript svchannels/utils/R/vcf2bedpe.R -i training_sample_ground_truth.vcf \
                                       -o training_sample_ground_truth.bedpe
Rscript svchannels/utils/R/vcf2bedpe.R -i training_sample_manta.vcf \
                                       -o training_sample_manta.bedpe
  1. Generate channels.
svchannels generate-channels --reference reference.fasta signals channels training_sample_manta.bedpe
  1. Label SVs.
svchannels label -f reference.fasta.fai -o labels channels/sv_positions.bedpe training_sample_ground_truth.bedpe
  1. Train the model.
svchannels train channels/channels.zarr.zip labels/labels.json.gz -m model.keras

If there are multiple training samples, step 1-4 are repeated for each sample to generate channels and labels. The channels and labels for the training samples are added as comma-separated arguments in step 5. See an example below:

svchannels train \
    channels_sample1/channels.zarr.zip,channels_sample2/channels.zarr.zip \
    labels_sample1/labels.json.gz,labels_sample2/labels.json.gz \
    -m model.keras

Note: For the purpose of CI testing, the same BAM file is used for both model training and testing.

Contributing

If you want to contribute to the development of sv-channels, have a look at the CONTRIBUTING.md.

License

Copyright (c) 2023, Netherlands eScience Center

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.