Context

Mappability can be challenging in repetitive regions when analyzing NGS-derived data. Local duplication of genomic regions hampers the alignment of reads in those parts of the genome, since it prevents unambiguously alingning reads. Therefore, those reads are usually assigned a random location (among the possible ones) and a low mapping quality. Since reads are assigned a random location, mutations are usually either missed (mutated reads are dispersed among all possible alignment regions) or, even worse, called multiple times leading to the rise of false positives (enough reads to call a variant in multiple alignment regions). For this reason, low mapping quality regions are often ignored in variant calling pipelines.

Armadillo is a somatic variant calling pipeline that aims at repetitive regions for tumor-control paired samples, in order to recover mutations that are systematically missed. In order to avoid both false positives and false negatives, it first finds all candidate regions for a repetitive region of interest (RROI). Then, it extracts all reads coming from all regions and aligns them at the RROI sequence only instead of the whole genome, forcing all reads to be aligned together. This way, dilution of variant-supporting reads is prevented, as well as the false positives caused by mistakenly call a mutation at multiple copies. Finally, the variant calling step is carried out. After usual heuristic filters, we use context-specific filters to prevent false positives related the stacking step and, lastly, run a CNN model to classify the mutations.

Armadillo data-prep

Armadillo data-prep is a small tool to make easy building the reference file needed by armadillo, our main tool. This tool was thought to be pipelined with getFasta and BLAT. This way, with a simple BED file and a FASTA file, the reference file needed by Armadillo is built easily.

More information on the pipeline can be retrieved in our GitHub repository or in our paper.

Inputs

Required inputs

blast8: Input file. It's the BLAT output in blast8 format for all regions needed.
reference: Fasta file of species genome.

Optional inputs

referenceIdx: Reference's index (fai file). Recommended for shorter runs
identity: Minimum identity for a hit to be considered a copy of the regions of interest (ROIs)
lendiff: Maximum %length difference allowed between each hit and the input sequence.
mlen: Minimum length of a ROI.
outputName: Basename of the output file

Outputs

armadillo_data: Armadillo reference file in gz format.

Example

Copy bams and ROIs to batchX

bx cp rois.bed hg19.fa rois bx://test/

A list of precomputed rois of interest is available in our github repo for hg19 or here for hg38.

Submit job

bx submit -v=1 -m=4000 uniovi@labxa/armadillo-dataprep:1.0.0 \
'{
  "blast8": "test",
  "reference": "/test/rois.bed",
  "tumorGenome": "/test/hg19.fa"}'

Tools version

Samtools (built with v.1.9)
Python3 (built with v.3.7)

Authors

Pablo Bousquets Muñoz - bousquetspablo@uniovi.es
Ander Díaz Navarro
Xose Antón Suárez Puente

pbousquets/armadillo-dataprep-batchx