Aperture: Alignment-free detection of structural variations and viral integrations in circulating tumor DNA
Aperture is a new alignment-free SV caller designed for cfDNA dataset. Aperture applies a unique strategy of k-mer based searching, fast breakpoint detection using binary labels and candidates clustering to detect SVs and viral integrations in high sensitivity, especially when junctions span repetitive regions, followed by a barcode based filter to ensure specificity. Aperture takes paired-end reads in FASTQ format as inputs and reports all SVs and viral integrations in VCF 4.2 format.
If you have any trouble running Aperture, please raise an issue using the Issues tab above.
Click here to download Aperture
To run Aperture, java 1.8 or later version must be installed in your system.
- CPU Aperture does not require or benefit from any specific modern CPU feature, but more physical cores and faster clock will significantly improve performance.
- Memory Typically, Aperture needs 40GB in index building and 30GB in SV calling for human genome (hg19 or hg38). The exact requirement depends on many factors including reference genome, sequencing depth, cfDNA insert size and sample quality.
Aperture takes a Aperture index and a set of cfDNA read files and outputs SV results in VCF format.
Pre-compiled binaries are available at https://github.com/liuhc8/Aperture/releases.
Aperture needs a indexed sequence file (in FASTA and FAI format) and a corresponding common SNP database (in VCF format) to build Aperture index. If FAI file is missing, you can use faidx
command in samtools
to create one. Aperture outputs a set of 5 files with suffixes .ci
.tt
.km
.long.km
and .spaced.km
. These files together constitute the index, and the original FASTA files are no longer used by Aperture once the index is built.
Human reference genome and the corresponding common SNP database can be downloaded here: hg19 hg38
Pre-built Aperture indexs for hg19 and hg38 are available here: hg19 hg38
A pre-built toy index including chr21 is available here: toy index
Usage: java -jar aperture.jar index -R <genome.fa> -V <snp.vcf> -O <out> -T <threads>
argument | description |
---|---|
-h,--help | Show help message |
-O,--out | Output path |
-R,--reference | Genome FASTA file with fai index |
-T,--threads | Number of threads |
-V,--vcf | Common SNPs database for the corresponding genome |
java -Xmx40g -jar fusion_test/aperture12.jar index -R hg19.fa -V dbsnp_common_hg19.vcf -O aperture_hg19 -T 30
Aperture needs a pair of FastQ files and an Aperture index as input. The output is in compressed VCF format (.vcf.gz). Aperture supports barcode based filter to ensure specificity. So if your dataset is produced by abundant sequencing and contains barcode as unique molecular identifier, parameters including -1BS
, -2BS
, -1BL
, -2BL
, -1S
and -2S
should be used to specify the location of barcodes in a read.
The following diagram gives a brief introduction to barcode-related parameters:
Usage: java -jar aperture.jar call -1 <arg> -1BL <arg> -1BS <arg> -1S <arg> -2 <arg> -2BL <arg> -2BS <arg> -2S <arg> -D <arg> [-H] -I <arg> -P <arg> -T <arg>
Argument | Description |
---|---|
-1,--r1 | Path of R1.fq.gz |
-1BL,--r1BarLen | Length of barcode in R1 |
-1BS,--r1BarStart | Barcode start index in R1 (0-based) |
-1S,--r1InsStart | ctDNA fragment start index in R1 (0-based) |
-2,--r2 | Path of R2.fq.gz |
-2BL,--r2BarLen | Length of barcode in R2 |
-2BS,--r2BarStart | Barcode start index in R2 (0-based) |
-2S,--r2InsStart | ctDNA fragment start index in R2 (0-based) |
-D,--dir | Output path |
-H,--help | Show help message |
-I,--index | Path of Aperture index |
-P,--project | Project name |
-T,--threads | Number of threads |
curl -L https://ndownloader.figshare.com/files/26914970 --output test_bar_R1.fq.gz
curl -L https://ndownloader.figshare.com/files/26914973 --output test_bar_R2.fq.gz
curl -L https://ndownloader.figshare.com/files/26914805 --output chr21.tar.gz
tar -vxf chr21.tar.gz
java -Xmx30g -jar aperture.jar call -1 test_bar_R1.fq.gz -2 test_bar_R2.fq.gz -I hg38_small -D ./ -P test -1BS 0 -2BS 0 -1BL 8 -2BL 0 -1S 8 -2S 0 -T 4
The expected output test_toyindex_ap12.sv.vcf.gz
is available in example
folder of this repository.
The expected runtime of this test sample is about 15 seconds using 4 threads.
In Aperture, all SVs are described as breakends and thus all the records in Aperture VCF are identified with the tag “SYTYPE=BND” in the INFO field.
Aperture VCF output follows the VCF 4.2 spec. All custom fields are described in the VCF header.
ID | Description |
---|---|
LOW_QUAL | Low quality call |
FAKE_BP | False positive variant caused by imprecise k-mer based mapping |
SMALL_EVENT | Event size is smaller than the minimum reportable size |
ID | Description |
---|---|
SVTYPE | Type of structural variant |
STRANDS | Strand orientation of the adjacency |
REFQUA | K-mer mapping quality of reference junction |
VARQUA | K-mer mapping quality of variant junction |
REFKMER | Number of k-mers supporting reference junction in average |
VARKMER | Number of k-mers supporting variant junction in average |
BPSEQQUA | Quality of sequence spanning breakpoint junction |
PARID | ID of partner breakend |
HOMLEN | Length of base pair identical micro-homology at event breakpoints |
HOMSEQ | Sequence of base pair identical micro-homology at event breakpoints |
ID | Description |
---|---|
GT | Genotype (Not applicable) |
SR | Count of split reads supporting the breakpoint |
PE | Count of paired-end reads supporting the breakpoint |
REFSR | Count of split reads supporting the reference junction |
VARSR | Count of split reads supporting the variant junction |
BAR | Count of cfDNA molecules supporting the breakpoint |
UBAR | Count of cfDNA molecules with only one read support |
For citing Aperture and for an overview of the Aperture algorithms, refer to the following article:
Aperture: alignment-free detection of structural variations and viral integrations in circulating tumor DNA. Hongchao Liu, Huihui Yin, Guangyu Li, Junling Li, Xiaoyue Wang. Brief Bioinform. 2021;bbab290. doi:10.1093/bib/bbab290
See the publication page for links of the simulation datasets.