Deterministic and standard extractor of indels, SNVs and structural variants (SVs) from VCF files built under the frame of EUCANCan's second work package. VariantExtractor is a Python package (requires Python version 3.6 or higher) and provides a set of data structures and functions to extract variants from VCF files in a deterministic and standard way while adding information to facilitate afterwards processing. It homogenizes multiallelic variants, MNPs and SVs (including imprecise paired breakends and single breakends). The package is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis. Check the available documentation for more information.
While there is somewhat of an agreement on how to label the SNVs and indels variants, this is not the case for the structural variants. In the current scenario, different labeling between variant callers makes comparisons between structural variants difficult. This package provides an unified interface to extract variants (included structural variants) from VCFs generated by different variant callers. Apart from reading the VCF file, VariantExtractor adds a preprocessing layer to homogenize the variants extracted from the file. This way, the variants can be used in downstream analysis in a consistent way. For more information about the homogenization process, check the homogenization rules section.
VariantExtractor is available on PyPI and can be installed using pip
:
pip install variant-extractor
# Import the package
from variant_extractor import VariantExtractor
# Create a new instance of the class
extractor = VariantExtractor('/path/to/file.vcf')
# Iterate through the variants
for variant_record in extractor:
print(f'Found variant of type {variant_record.variant_type.name}: {variant_record.contig}:{variant_record.pos}')
# Import the package
from variant_extractor import VariantExtractor
# Create a new instance of the class
extractor = VariantExtractor('/path/to/file.vcf')
# Save variants to a CSV file
extractor.to_dataframe().drop(['variant_record_obj'], axis=1).to_csv('/path/to/output.csv', index=False)
For a more complete list of examples, check the examples directory. This folder also includes an example of a script for normalizing VCF files following the homogenization rules.
The VariantExtractor
constructor returns a generator of VariantRecord
instances. The VariantRecord
class is a container for the information contained in a VCF record plus some extra useful information.
Property | Type | Description |
---|---|---|
contig |
str |
Contig name |
pos |
int |
Position on the contig |
end |
int |
End position of the variant in the contig (same as pos for TRA and SNV) |
length |
int |
Length of the variant |
id |
Optional[str] |
Record identifier |
ref |
str |
Reference sequence |
alt |
str |
Alternative sequence |
qual |
Optional[float] |
Quality score for the assertion made in ALT |
filter |
List[str] |
Filter status. PASS if this position has passed all filters. Otherwise, it contains the filters that failed |
info |
Dict[str, Any] |
Additional information |
format |
List[str] |
Specifies data types and order of the genotype information |
samples |
Dict[str, Dict[str, Any]] |
Genotype information for each sample |
variant_type |
VariantType |
Variant type inferred |
alt_sv_breakend |
Optional[ BreakendSVRecord ] |
Breakend SV info, present only for SVs with breakend notation. For example, G]17:198982] |
alt_sv_shorthand |
Optional[ ShorthandSVRecord ] |
Shorthand SV info, present only for SVs with shorthand notation. For example, <DUP:TANDEM> |
The VariantType
enum describes the type of the variant. For structural variants, it is inferred only from the breakend notation (or shorthand notation). It does not take into account any INFO
field (SVTYPE
nor EVENTYPE
) that might be added by the variant caller afterwards.
REF | ALT | Variant name | Description |
---|---|---|---|
A | G | SNV | Single nucleotide variant |
AGTG | A | DEL | Deletion |
A | A[1:20[ or <DEL> | DEL | Deletion |
A | ACCT or <INS> | INS | Insertion |
A | ]1:20]A or <DUP> | DUP | Duplication |
A | A]1:20] or [1:20[A | INV | Inversion. <INV> is a special case |
A | <CNV> | CNV | Copy number variation |
A | A]X:20] or A[X:20[ or ]X:20]A or [X:20[A | TRA | Translocation |
A | A. or .A | SGL | Single breakend |
The BreakendSVRecord
class is a container for the information contained in a VCF record for SVs with breakend notation.
Property | Type | Description |
---|---|---|
prefix |
Optional[str] |
Prefix of the SV record with breakend notation. For example, for G]17:198982] the prefix will be G |
bracket |
str |
Bracket of the SV record with breakend notation. For example, for G]17:198982] the bracket will be ] |
contig |
str |
Contig of the SV record with breakend notation. For example, for G]17:198982] the contig will be 17 |
pos |
int |
Position of the SV record with breakend notation. For example, for G]17:198982] the position will be 198982 |
suffix |
Optional[str] |
Suffix of the SV record with breakend notation. For example, for G]17:198982] the suffix will be None |
The ShorthandSVRecord
class is a container for the information contained in a VCF record for SVs with shorthand notation.
Property | Type | Description |
---|---|---|
type |
str |
Type of the SV record with shorthand notation. One of the following, 'DEL' , 'INS' , 'DUP' , 'INV' or 'CNV' . For example, for <DUP:TANDEM> the type will be DUP |
extra |
List[str] |
Extra information of the SV. For example, for <DUP:TANDEM:AA> the extra will be ['TANDEM', 'AA'] |
VariantExtractor provides a unified interface to extract variants (included structural variants) from VCF files generated by different variant callers. The variants are homogenized and returned applying the following rules:
An entry with multiple ALT
sequences (multiallelic) is divided into multiple entries with a single ALT
field. This entries with a single ALT
field are then processed with the rest of the homogeneization rules. For example:
CHROM | POS | ID | REF | ALT | FILTER |
---|---|---|---|---|---|
2 | 1 | multiallelic_1 | G | C,T | PASS |
is returned as:
CHROM | POS | ID | REF | ALT | FILTER | VariantType |
---|---|---|---|---|---|---|
2 | 1 | multiallelic_1_0 | G | C | PASS | SNV |
2 | 1 | multiallelic_1_1 | G | T | PASS | SNV |
Entries with REF/ALT
of the same lenghts are treated like SNVs. If the REF/ALT
sequences are more than one nucleotide (MNPs), they are divided into multiple atomic SNVs. For example:
CHROM | POS | ID | REF | ALT | FILTER |
---|---|---|---|---|---|
2 | 1 | snv_1 | C | G | PASS |
2 | 3 | mnp_1 | TAG | AGT | PASS |
are returned as:
CHROM | POS | ID | REF | ALT | FILTER | VariantType |
---|---|---|---|---|---|---|
2 | 3 | snv_1 | C | G | PASS | SNV |
2 | 3 | mnp_1_0 | T | A | PASS | SNV |
2 | 4 | mnp_1_1 | A | G | PASS | SNV |
2 | 5 | mnp_1_2 | G | T | PASS | SNV |
VariantExtractor returns one entry per structural variant (one entry per breakend pair). This helps to avoid the ambiguity of the notation and keeps the process deterministic. For this reason, in case of paired breakends, the breakend with the lowest chromosome and/or position is returned. If a breakend is not the lowest chromosome and/or position and is missing its pair, its pair is inferred and returned.
Entries with the same information, either described with shorthand or breakend notation, will be returned the same way. Here is an example for a DEL entry:
CHROM | POS | ID | REF | ALT | FILTER | INFO |
---|---|---|---|---|---|---|
1 | 3000 | event_1_o | A | A[1:5000[ | PASS | SVTYPE=BND |
1 | 5000 | event_1_h | A | ]1:3000]A | PASS | SVTYPE=BND |
1 | 3000 | event_1 | A | A[1:5000[ | PASS | SVTYPE=DEL |
1 | 3000 | event_1 | A | <DEL> | PASS | SVTYPE=DEL; END=5000 |
1 | 3000 | event_1 | AGTCACAA... | A | PASS |
are returned as one entry (each one of them with their own ALT
field), but with the same VariantRecord.end
and VariantType
:
CHROM | POS | ID | REF | ALT | FILTER | INFO | VariantType |
VariantRecord.end |
---|---|---|---|---|---|---|---|---|
1 | 3000 | event_1 | A | ... | PASS | ... | DEL | 5000 |
<INV> is a special case of shorthand notation because it represents two paired breakends. For example, the following shorthand notation:
CHROM | POS | ID | REF | ALT | FILTER | INFO |
---|---|---|---|---|---|---|
2 | 321682 | event_1 | T | <INV> | PASS | SVTYPE=INV;END=421681 |
is equivalent to the following breakends:
CHROM | POS | ID | REF | ALT | FILTER | INFO | VariantType |
---|---|---|---|---|---|---|---|
2 | 321681 | event_1_0 | N | N]2:421681] | PASS | SVTYPE=INV | INV |
2 | 321682 | event_1_1 | T | [2:421682[T | PASS | SVTYPE=INV | INV |
In this case, VariantExtractor converts internally <INV> to two entries with breakend notation (one for each breakend pair). Note that the N
will be replaced with the correct nucleotide if fasta_ref
is provided to VariantExtractor.
For paired breakends, breakends are paired using the INFO
fields MATEID
or PARID
. If these fields are not available, they are paired using their coordinates (contig+position). The breakend with the lowest chromosome and/or position is returned. For example:
CHROM | POS | ID | REF | ALT | FILTER | INFO |
---|---|---|---|---|---|---|
2 | 3000 | event_1_o | T | ]3:5000]T | PASS | SVTYPE=BND |
3 | 5000 | event_1_h | G | G[2:3000[ | PASS | SVTYPE=BND |
1 | 3000 | event_2_o | A | A[1:5000[ | PASS | SVTYPE=BND |
1 | 5000 | event_2_h | A | ]1:3000]A | PASS | SVTYPE=BND |
are returned as one entry per variant:
CHROM | POS | ID | REF | ALT | FILTER | INFO | VariantType |
---|---|---|---|---|---|---|---|
2 | 3000 | event_1_o | T | ]3:5000]T | PASS | SVTYPE=BND | TRA |
1 | 3000 | event_2_o | A | A[1:5000[ | PASS | SVTYPE=BND | DEL |
If all breakends are missing their pair, the breakends with the lowest chromosome and/or position are inferred and returned. For example:
CHROM | POS | ID | REF | ALT | FILTER | INFO |
---|---|---|---|---|---|---|
3 | 5000 | event_1_h | G | G[2:3000[ | PASS | SVTYPE=BND |
1 | 5000 | event_2_h | A | ]1:3000]A | PASS | SVTYPE=BND |
are returned as their inferred breakend pair with the lowest chromosome and/or position:
CHROM | POS | ID | REF | ALT | FILTER | INFO | VariantType |
---|---|---|---|---|---|---|---|
2 | 3000 | event_1_h | N | ]3:5000]N | PASS | SVTYPE=BND | TRA |
1 | 3000 | event_2_h | A | A[1:5000[ | PASS | SVTYPE=BND | DEL |
Note that the N
will be replaced with the correct nucleotide if fasta_ref
is provided to VariantExtractor. The following equivalencies are applied:
CHROM1 | POS1 | REF1 | ALT1 | CHROM2 | POS2 | REF2 | ALT2 |
---|---|---|---|---|---|---|---|
1 | 500 | N | N[7:800[ | 7 | 800 | N | ]1:500]N |
1 | 500 | N | ]7:800]N | 7 | 800 | N | N[1:500[ |
1 | 500 | N | [7:800[N | 7 | 800 | N | [1:500[N |
1 | 500 | N | N]7:800] | 7 | 800 | N | N]1:500] |
Imprecise breakends do not match exactly with their pair in coordinates. In this case, they are paired using the INFO
fields MATEID
or PARID
. As with the rest of variants, for each breakend pair, only the breakend with the lowest chromosome and/or position is returned. However, it is important to notice that the CIPOS
field is lost for the other breakend. For example:
CHROM | POS | ID | REF | ALT | FILTER | INFO |
---|---|---|---|---|---|---|
2 | 3010 | event_1_o | T | T[3:5000[ | PASS | SVTYPE=BND;CIPOS=0,50;PARID=event_1_h |
3 | 5050 | event_1_h | A | ]2:3050]A | PASS | SVTYPE=BND;CIPOS=0,100;PARID=event_1_o |
are paired and the entry with the lowest chromosome and/or position is returned:
CHROM | POS | ID | REF | ALT | FILTER | INFO | VariantType |
---|---|---|---|---|---|---|---|
2 | 3010 | event_1_o | T | T[3:5000[ | PASS | SVTYPE=BND;CIPOS=0,50;PARID=a_h | TRA |
Single breakends cannot be matched with other breakends because they lack a mate. They may be able to be matched later in downstream analysis. That is why each one is kept as a different variant. For example:
CHROM | POS | ID | REF | ALT | FILTER | INFO |
---|---|---|---|---|---|---|
2 | 3000 | event_s | T | T. | PASS | SVTYPE=BND |
3 | 5000 | event_m | G | .G | PASS | SVTYPE=BND |
are returned as two entries:
CHROM | POS | ID | REF | ALT | FILTER | INFO | VariantType |
---|---|---|---|---|---|---|---|
2 | 3000 | event_s | T | T. | PASS | SVTYPE=BND | SGL |
3 | 5000 | event_m | G | .G | PASS | SVTYPE=BND | SGL |
The dependencies are covered by their own respective licenses as follows:
- Python/Pysam package (MIT license)