SVJedi is a structural variation (SV) genotyper for long read data. Based on a representation of the different alleles, it estimates the genotype of each variant in a given individual sample based on allele-specific alignment counts. SVJedi takes as input a variant file (VCF), a reference genome (fasta) and a long read file (fasta/fastq) and outputs the initial variant file with an additional column containing genotyping information (VCF).
SVJedi processes deletions, insertions, inversions and translocations.
SVJedi is organized in three main steps:
- Generate representative allele sequences of a set of SVs given in a vcf file
- Map reads on previously generated allele sequences using Minimap2
- Genotype SVs and output a vcf file
Jedi comes from the verb jediñ ['ʒeːdɪ] in Breton, it means calculate.
- Python3
- Minimap2
- NumPy
- Biopython
python3 svjedi.py -v <set_of_sv.vcf> -r <refgenome.fasta> -i <long_reads.fastq>
Note: Chromosome names in reference.fasta
and in set_of_sv.vcf
must be the same.
Also, the SVTYPE
tag must be present in the VCF (SVTYPE=DEL
or SVTYPE=INS
or SVTYPE=INV
or SVTYPE=BND
).
More details are given in SV representation in VCF.
git clone https://github.com/llecompte/SVJedi.git
SVJedi is also distributed as a Bioconda package:
conda install -c bioconda svjedi
The folder Data/HG002_son includes an example of 20 SVs (10 insertions and 10 deletions) to genotype on a subsample of a real human dataset of the Ashkenazim son HG002.
Example command line:
python3 svjedi.py -v Data/HG002_son/HG002_20SVs_Tier1_v0.6_PASS.vcf -a Data/HG002_son/reference_at_breakpoints.fasta -i Data/HG002_son/PacBio_reads_set.fastq.gz -o Data/HG002_son/genotype_results.vcf
The folder Data/C_elegans includes an example on 12 SVs (del, ins, inv, bnd) to genotype with a small synthetic read dataset on a subset of the Caenorhabditis elegans genome.
Example command line:
python3 svjedi.py -v Data/C_elegans/test.vcf -r Data/C_elegans/genome.fasta -i Data/C_elegans/simulated-reads.fastq.gz
SVJedi two different usages from non aligned reads or from aligned reads (PAF format).
python3 svjedi.py -v <set_of_sv.vcf> -r <refgenome.fasta> -i <long_reads.fastq>
python3 svjedi.py -v <set_of_sv.vcf> -a <refallele.fasta> -i <long_reads.fastq>
python3 svjedi.py -v <set_of_sv.vcf> -p <alignments.paf>
Option | Description |
---|---|
-v/--vcf | Set of SVs in VCF |
-r/--ref | Reference genome in FASTA |
-i/--input | Sequenced long reads in FASTQ or FASTQ.GZ (1 file or multiple files) |
-a/--allele | Reference sequences of alleles |
-p/--paf | Alignments in PAF |
-o/--output | Output file with genotypes in VCF |
-ms/--minsupport | Minimum number of informative alignments to assign a genotype |
-dover | Breakpoint distance overlap required (default 100 bp) |
-dend | Soft-clipping length allowed to consider a semi-global alignment (default 100 bp) |
-d/--data | Type of sequencing data, either ont or pb (default pb) |
-t/--threads | Number of threads for mapping |
-h/--help | Show help |
Here are the information needed for SVJedi to genotype the following SV types. All variants must have the CHROM
and POS
fields defined, with the chromosome names in reference.fasta
and in set_of_sv.vcf
that must be the same. Then additional information is required according to SV type:
-
Deletion
- Either
ALT
field is<DEL>
orINFO
field must containSVTYPE=DEL
INFO
field must contain eitherEND=pos
(withpos
being the end position of the deleted segment) orSVLEN=len
(withlen
being the size of the deletion) tags
- Either
-
Insertion
INFO
field must containSVTYPE=INS
ALT
field must contain the sequence of the insertion
-
Inversion
- Either
ALT
field is<INV>
orINFO
field must containSVTYPE=INV
INFO
field must containEND=pos
tag, withpos
being the second breakpoint position
- Either
-
Translocation
INFO
field must containSVTYPE=BND
andCHR2=
andEND=
tags- CHR2 name and sequence must be in the reference genome fasta file
ALT
field must be formated as:t[chr:pos[
,t]chr:pos]
,]chr:pos]t
or[chr:pos[t
, withchr
andpos
indicating the second breakpoint position and brackets directions indicating which parts of the two chromosomes should be joined together
SVJedi: Genotyping structural variations with long reads. Lecompte L, Peterlongo P, Lavenier D, Lemaitre C. Bioinformatics 2020 doi:10.1093/bioinformatics/btaa527 (bioRxiv preprint)
SVJedi is a Genscale tool developed by Lolita Lecompte lolita.lecompte@inria.fr