LOFTEE (Loss-Of-Function Transcript Effect Estimator)

Loss-of-function pipeline inspired by MacArthur et al., 2012.

A VEP plugin to identify LoF (loss-of-function) variation.

Currently assesses variants that are:

Stop-gained
Splice site disrupting
Frameshift variants

Filters

LOFTEE implements a set of filters to deem a LoF as "low-confidence."

For stop-gained and frameshift variants, LOFTEE removes:

Variants that are in the last X% of the transcript (filter_position; default = 5%)
Variants that land in an exon with non-canonical splice sites around it (i.e. intron does not start with GT and end with AG)

For splice-site variants, LOFTEE removes:

Variants in small introns (min_intron_size; default = 15 bp)
Variants that fall in an intron with a non-canonical splice site (i.e. intron does not start with GT and end with AG).

For all variants, LOFTEE removes:

Variants where the purported LoF allele is the ancestral state (across primates)

Flags

LOFTEE implements a series of flags based on some variant parameters.

For stop-gained and frameshift variants, LOFTEE flags:

Variants in genes with only a single exon

For splice-site variants, LOFTEE flags:

Variants in NAGNAG sites (acceptor sites rescued by in-frame acceptor site)

Requirements

VEP
= Perl 5.10.1
Ancestral sequence (human_ancestor.fa[.rz])
Samtools (must be on path)

Usage

LOFTEE is easiest run when it is in the VEP plugin directory (~/.vep/Plugins/): mv LoF.pm ~/.vep/Plugins. Alternatively, VEP can be run with --dir_plugins to specify a plugins directory.

Basic usage:

perl variant_effect_predictor.pl [--other options to VEP] --plugin LoF

Advanced usage:

perl variant_effect_predictor.pl [--other options to VEP] --plugin LoF,human_ancestor_fa:/path/to/human_ancestor.fa[.rz]

perl variant_effect_predictor.pl [--other options to VEP] --plugin LoF,human_ancestor_fa:/path/to/human_ancestor.fa,filter_position:0.05

Options:

filter_position

Position in transcript where a variant should be filtered. Default is 0.05, corresponding to last 5% of transcript.

min_intron_size

Minimum intron size, below which a variant should be filtered.

fast_length_calculation

The Ensembl API can be used to calculate transcript length in two different methods: one approximate (fast; usually within 3 bp of correct length) and one perfect (slow). Default: fast.

human_ancestor_fa

Location of human_ancestor.fa file (need associated tabix index file), available for download here: http://www.broadinstitute.org/~konradk/loftee/human_ancestor.fa.rz and http://www.broadinstitute.org/~konradk/loftee/human_ancestor.fa.rz.fai. Courtesy of Javier Herrero and the 1000 Genomes Project (source: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/). If this flag is set to 'false', the ancestral allele will not be checked and filtered.

check_complete_cds

The Ensembl API contains a "Complete CDS" annotation that indicates that a start and stop codon has been identified for this transcript. This flag unfortunately requires Ensembl database access, and thus, severely decreases performance and is disabled by default.

Output

The output is the standard VEP output, or standard VEP VCF if --vcf is passed to VEP. For those unfamiliar with VEP's VCF output, the annotations are written to the CSQ attribute of the INFO field. Here, a comma-separated list of consequences, corresponding to each transcript-(alternate)allele pair, is written with each entry as a pipe-delimited set of annotations. With more alleles and transcripts (and especially with the --everything flag), this will inevitably make for some very long INFO fields that are difficult to parse by eye.

See src/read_vep_vcf.py for a barebones example of a parsing script, or the section below on Parsing the VEP/LoF VCF for some tips and tricks.

From VEP, a VCF line may look like:

1       1178848 rs115005664     G       A       1000.0   PASS   AC=1;AF=0.5;AN=2;CSQ=A|ENSG00000184163|ENST00000468365|Transcript|non_coding_exon_variant&nc_transcript_variant|445|||||||-1|||,A|ENSG00000184163|ENST00000462849|Transcript|upstream_gene_variant|||||||5|-1|||,A|ENSG00000184163|ENST00000486627|Transcript|downstream_gene_variant|||||||513|-1|||,A|ENSG00000184163|ENST00000330388|Transcript|stop_gained|648|616|206|Q/*|Cag/Tag|||-1|||HC,A|ENSG00000184163|ENST00000478606|Transcript|upstream_gene_variant|||||||310|-1|||`

This is comprehensive, but the crucial information is in the CSQ= part, so here we have the line split up by allele-transcript pair:

A|ENSG00000184163|ENST00000468365|Transcript|non_coding_exon_variant&nc_transcript_variant|445|||||||-1|||,
A|ENSG00000184163|ENST00000462849|Transcript|upstream_gene_variant|||||||5|-1|||,
A|ENSG00000184163|ENST00000486627|Transcript|downstream_gene_variant|||||||513|-1|||,
A|ENSG00000184163|ENST00000330388|Transcript|stop_gained|648|616|206|Q/*|Cag/Tag|||-1|||HC,
A|ENSG00000184163|ENST00000478606|Transcript|upstream_gene_variant|||||||310|-1|||

The overall format of a VCF is described on the VCF Specification Page. The key to parsing this section is in the header line added by VEP.

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|LoF_flags|LoF_filter|LoF">

This line contains the corresponding mappings to these fields after Format::

Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|LoF_flags|LoF_filter|LoF

Parsing the VEP/LoF VCF

One approach that simplifies extracting data from the INFO column is to convert the text representation to a dictionary, where each annotation entry is a key-value pair. Since CSQ entries in the INFO field can contain multiple allele-transcript pairs, we will need to make a list of these dictionaries -- one dictionary per pair.

In Python, the header line line can be read with:

vep_field_names = line.split('Format: ')[-1].strip('">').split('|')

To access the annotations, the VCF record can then be read with:

# Split VCF line
fields = vcf_line.split('\t')

# Split INFO field by semicolons (using lookahead regular expressions due to VEP introducing semi-colons into the INFO field in some scenarios)
# This creates a dictionary with key-value pairs of info field.
info_field = dict([(x.split('=', 1)) for x in re.split(';(?=\w)', fields[7]) if x.find('=') > -1])

# For instance, info_field['AF'] would return the allele frequency of that variant.

# Pull together the VEP field names from before with CSQ attribute from INFO field (which are pipe-delimited).
annotations = [dict(zip(vep_field_names, x.split('|'))) for x in info_field['CSQ'].split(',')]

annotations is now a list of dictionaries like so:

[{'Allele': 'A',
  'Amino_acids': '',
  'CDS_position': '',
  'Codons': '',
  'Consequence': 'non_coding_exon_variant&nc_transcript_variant',
  'DISTANCE': '',
  'Existing_variation': '',
  'Feature': 'ENST00000468365',
  'Feature_type': 'Transcript',
  'Gene': 'ENSG00000184163',
  'LoF': '',
  'LoF_filter': '',
  'LoF_flags': '',
  'Protein_position': '',
  'STRAND': '-1',
  'cDNA_position': '445'},
  ...
  {'Allele': 'A',
  'Amino_acids': 'Q/*',
  'CDS_position': '616',
  'Codons': 'Cag/Tag',
  'Consequence': 'stop_gained',
  'DISTANCE': '',
  'Existing_variation': '',
  'Feature': 'ENST00000330388',
  'Feature_type': 'Transcript',
  'Gene': 'ENSG00000184163',
  'LoF': 'HC',
  'LoF_filter': '',
  'LoF_flags': '',
  'Protein_position': '206',
  'STRAND': '-1',
  'cDNA_position': '648'}, ...]

CSQ entries in the INFO field for a given variant can now be accessed easily. For example, to get the LoF_filter field for the first allele transcript pair use: annotations[0]['LoF_filter'].

LoF annotations can be extracted as:

lof_annotations = [x for x in annotations if x['LoF'] == 'HC']

HC refers to high-confidence LoF variants (i.e. does not fail any filters). LC denotes low-confidence, failing at least one filter, which are written to the LoF_filter field.

Possible values for the LoF_filter field are:

END_TRUNC
EXON_INTRON_UNDEF
INCOMPLETE_CDS
NON_CAN_SPLICE_SURR
EXON_INTRON_UNDEF
SMALL_INTRON
NON_CAN_SPLICE
ANC_ALLELE

Possible values for the Lof_flags field are:

SINGLE_EXON
NAGNAG_SITE

Special thanks to Monkol Lek for the initial implementation of the software and developing many of these filters.

sestaton/loftee