/ensembl-vep

The Variant Effect Predictor predicts the functional effects of genomic variants

Primary LanguagePerl

Used VEP? Please help us guide the future development of VEP by participating in our user survey.

Coverage Status

ensembl-vep

  • VEP (Variant Effect Predictor) predicts the functional effects of genomic variants.
  • Haplosaurus uses phased genotype data to predict whole-transcript haplotype sequences.
Table of contents

Installation and requirements

The VEP package requires Perl (>=5.10 recommended, tested on 5.8, 5.10, 5.14, 5.18, 5.22) and the DBI package installed. The remaining dependencies can be installed using the included INSTALL.pl script. Basic instructions:

git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl

The installer may also be used to check for updates to this and co-dependent packages, simply re-run INSTALL.pl.

See documentation for full installation instructions.

Additional CPAN modules

The following modules are optional but most users will benefit from installing them. We recommend using cpanminus to install.

  • DBD::mysql - required for database access (--database or --cache without --offline)
  • Set::IntervalTree - required for Haplosaurus, also confers speed updates to VEP
  • JSON - required for writing JSON output
  • PerlIO::gzip - faster compressed file parsing
  • Bio::DB::BigFile - required for reading custom annotation data from BigWig files

VEP

Usage

./vep -i input.vcf -o out.txt -offline

See documentation for full command line instructions.

Please report any bugs or issues by contacting Ensembl or creating a GitHub issue

Differences to ensembl-tools VEP

This ensembl-vep repo is a complete rewrite of the VEP code intended to make the software faster, more robust and more easily extensible. Almost all functionality of the ensembl-tools version has been replicated, with the command line flags remaining largely unchanged. A summary of changes follows:

  • Tool name: For brevity and to distinguish the two versions, the new command line tool is named vep, with the version in ensembl-tools named variant_effect_predictor.pl.
  • Speed: A typical individual human genome of 4 million variants can now be processed in around 30 minutes on a quad-core machine using under 1GB of RAM.
  • Known/existing variants: The alleles of your input variant are now compared to any known variants when using --check_existing. Previously this would require you to enable this functionality manually with --check_alleles. The old functionality can be restored using --no_check_alleles.
  • Allele frequencies: Allele frequencies are now reported for the input allele only e.g. as 0.023 instead of A:0.023,G:0.0005. To reflect this change, the allele frequency fields are now named e.g. AFR_AF instead of AFR_MAF. The command line flags reflect this also, so --gmaf is now --af and --maf_1kg is now --af_1kg. Using the old flags will produce a deprecation message.
  • GFF and GTF files: GFF and GTF files may now be used directly as a source of transcript annotation in place of, or even alongside, a cache or database source. Previously this involved building a cache using gtf2vep, which is now redundant. The files must first be bgzipped and tabix-indexed, and a FASTA file containing genomic sequence is required:
grep -v "#" data.gff | sort -k1,1 -k4,4n -k5,5n | bgzip -c > data.gff.gz
tabix -p gff data.gff.gz
./vep -i input.vcf -gff data.gff.gz -fasta genome.fa.gz
  • VCF custom annotations: VCF files used as a source of custom annotation will now have allele-specific data added from INFO fields; previously the whole content of each requested KEY=VALUE pair was reported.
  • New pick flags: New flags added to aid selecting amongst consequence output: --pick_allele_gene, --flag_pick_allele_gene
  • Runtime status: vep produces no runtime progress messages.
  • Deprecated:
    • GVF output: --gvf
    • HTML output: --html
    • format conversion: --convert
    • pileup input: --format pileup
    • MAF flags (replaced by AF flags): --gmaf (--af), --maf_1kg (--af_1kg), --maf_esp (--af_esp), --maf_exac (--af_exac)
    • known variant allele checking (on by default, use --no_check_alleles to restore old behaviour): --check_alleles
    • cache building flags (replaced by internal Ensembl pipeline): --build, --write_cache

Haplosaurus

haplo is a local tool implementation of the same functionality that powers the Ensembl transcript haplotypes view. It takes phased genotypes from a VCF and constructs a pair of haplotype sequences for each overlapped transcript; these sequences are also translated into predicted protein haplotype sequences. Each variant haplotype sequence is aligned and compared to the reference, and an HGVS-like name is constructed representing its differences to the reference.

This approach offers an advantage over VEP's analysis, which treats each input variant independently. By considering the combined change contributed by all the variant alleles across a transcript, the compound effects the variants may have are correctly accounted for.

haplo shares much of the same command line functionality with vep, and can use VEP caches, Ensembl databases, GFF and GTF files as sources of transcript data; all vep command line flags relating to this functionality work the same with haplo.

Usage

Input data must be a VCF containing phased genotype data for at least one individual; no other formats are currently supported.

When using a VEP cache as the source of transcript annotation, the first time you run haplo with a particular cache it will spend some time scanning transcript locations in the cache.

./haplo -i input.vcf -o out.txt -cache

Output

Output data is currently a simple tab-delimited file reporting all observed non-reference haplotypes. It has the following fields:

  1. Transcript stable ID
  2. CDS haplotype name
  3. Comma-separated list of flags for CDS haplotype
  4. Protein haplotype name
  5. Comma-separated list of flags for protein haplotype
  6. Comma-separated list of frequency data for protein haplotype
  7. Sample identifier
  8. Number of copies of this haplotype observed in sample

Flags

Haplotypes may be flagged with one or more of the following:

  • indel: haplotype contains an insertion or deletion (indel) relative to the reference.
  • frameshift: haplotype contains at least one indel that disrupts the reading frame of the transcript.
  • resolved_frameshift: haplotype contains two or more indels whose combined effect restores the reading frame of the transcript.
  • stop_changed: indicates either a STOP codon is gained (protein truncating variant, PTV) or the existing reference STOP codon is lost.
  • deleterious_sift_or_polyphen: haplotype contains at least one single amino acid substitution event flagged as deleterious (SIFT) or probably damaging (PolyPhen2).

Frequency data

Haplotype frequencies may be loaded and assigned to observed haplotypes using --haplotype_frequencies [file]. The following files may be used:

Note these files are temporarily hosted on 3rd party servers and may be subject to change or removal while the software remains in the development phase.