ESM Variants

This repository contains resources, tools, and command-line tools developed for the paper, "Genome-wide prediction of disease variant effects with a deep protein language model" by Nadav Brandes, Grant Goldman, Charlotte H. Wang, Chun Jimmie Ye, and Vasilis Ntranos. A complete catalog of missense variant effect predictions is accessible here.

Repository Contents

Table_of_results.xlsx These files contain all benchmark data, VEP predictions used for performance evaluation, and results, except HGMD variants (see below).

Most data used in this work is already within the public domain. Exceptions and other data sources are detailed in the paper and below:

The HGMD dataset is a private resource owned by the Institute of Medical Genetics in Cardiff University. Request access here.
ClinVar labels of missense variants in all protein isoforms and of indels and stop-gains are available here.
Full details on how the datasets and benchmarks were processed are available in the Supplementary Methods.
Predicted effect scores for most VEP methods were downloaded from dbNSFP.

ESM1b: This study leverages and expands the use of ESM1b, a protein language model developed by MetaAI. The code and pre-trained parameters for ESM1b were taken from the model’s official GitHub repository.
Web portal: We created a web portal allowing researchers to query, visualize, and download missense variant effect predictions for all protein isoforms in the human genome.

The following dependencies are required:

pip3 install tqdm numpy pandas biopython torch fair-esm

Clone the repository:

git clone https://github.com/ntranoslab/esm-variants.git
cd esm-variants

python3 esm_score_missense_mutations.py --input-fasta-file /path/to/input.fasta --output-csv-file /path/to/output.csv

python3 esm_score_multi_residue_mutations.py --input-csv-file /path/to/input.csv --output-csv-file /path/to/output.csv

The input CSV file for multi-residue mutations should have three fields:

wt_seq: the wild type (original) protein sequence
mut_seq: the mutated protein sequence
start_pos: the starting position (1-indexed) of the mutation relative to the wild type sequence

Assuming an example FASTA file named example.fasta:

>seq1
FISHWISHFQRCHIPSTHATARECRISP
>seq2
RAGEAGAINSTTHEMACHINE

You can calculate ESM scores for all possible missense mutations in these sequences:

python3 esm_score_missense_mutations.py --input-fasta-file example.fasta --output-csv-file esm_scores.csv

This will create a CSV file (esm_scores.csv) that starts like this:

seq_id,mut_name,esm_score
seq1,F1K,-3.2310808
seq1,F1R,-2.872289
seq1,F1H,-3.4361703
...

Each row represents a possible missense mutation and its ESM score.

Assuming the following example.csv:

wt_seq,mut_seq,start_pos
FISHWISHFQRCHIPSTHATARECRISP,FISHWISHFQRCHEESETHATARECRISP,14
MARGTYNMGKHFDA,MGTYNMGKHFDA,2

You can calculate ESM (PLLR) scores for the specified multi-residue mutations:

python3 esm_score_multi_residue_mutations.py --input-csv-file example.csv --output-csv-file esm_multi_residue_scores.csv

This will create a CSV file (esm_multi_residue_scores.csv) that starts like this:

wt_seq,mut_seq,start_pos,esm_score
FISHWISHFQRCHIPSTHATARECRISP,FISHWISHFQRCHEESETHATARECRISP,14,-1.0078125
MARGTYNMGKHFDA,MGTYNMGKHFDA,2,1.0056415

Each row represents a multi-residue mutation and its ESM (PLLR) score.