Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling

Ankh is the first general-purpose protein language model trained on Google's TPU-V4 surpassing the state-of-the-art performance with dramatically less parameters, promoting accessibility to research innovation via attainable resources.

This repository will be updated regulary with new pre-trained models for proteins in part of supporting the biotech community in revolutinizing protein engineering using AI.

Installation
Models Availability
Dataset Availability
Usage
Original downstream Predictions
Followup use-cases
Comparisons to other tools
Community and Contributions
Have a question?
Found a bug?
Requirements
Sponsors
Team
License
Citation

Installation

python -m pip install ankh

Models Availability

Model	ankh	Hugging Face
Ankh Large	`ankh.load_large_model()`	Ankh Large
Ankh Base	`ankh.load_base_model()`	Ankh Base

Datasets Availability

Dataset	Hugging Face
Remote Homology	`load_dataset("proteinea/remote_homology")`
CASP12	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP12.csv']})`
CASP14	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP14.csv']})`
CB513	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CB513.csv']})`
TS115	`load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['TS115.csv']})`
DeepLoc	`load_dataset("proteinea/deeploc")`
Fluorescence	`load_dataset("proteinea/fluorescence")`
Solubility	`load_dataset("proteinea/solubility")`
Nearest Neighbor Search	`load_dataset("proteinea/nearest_neighbor_search")`

Usage

Loading pre-trained models:

  import ankh

  # To load large model:
  model, tokenizer = ankh.load_large_model()
  model.eval()


  # To load base model.
  model, tokenizer = ankh.load_base_model()
  model.eval()

Feature extraction using ankh large example:

  model, tokenizer = ankh.load_large_model()
  model.eval()

  protein_sequences = ['MKALCLLLLPVLGLLVSSKTLCSMEEAINERIQEVAGSLIFRAISSIGLECQSVTSRGDLATCPRGFAVTGCTCGSACGSWDVRAETTCHCQCAGMDWTGARCCRVQPLEHHHHHH', 
  'GSHMSLFDFFKNKGSAATATDRLKLILAKERTLNLPYMEEMRKEIIAVIQKYTKSSDIHFKTLDSNQSVETIEVEIILPR']

  protein_sequences = [list(seq) for seq in protein_sequences]


  outputs = tokenizer.batch_encode_plus(protein_sequences, 
                                    add_special_tokens=True, 
                                    padding=True, 
                                    is_split_into_words=True, 
                                    return_tensors="pt")
  with torch.no_grad():
    embeddings = model(input_ids=outputs['input_ids'], attention_mask=outputs['attention_mask'])

Loading downstream models example:

  # To use downstream model for binary classification:
  binary_classification_model = ankh.ConvBertForBinaryClassification(input_dim=768, 
                                                                     nhead=4, 
                                                                     hidden_dim=384, 
                                                                     num_hidden_layers=1, 
                                                                     num_layers=1, 
                                                                     kernel_size=7, 
                                                                     dropout=0.2, 
                                                                     pooling='max')

  # To use downstream model for multiclass classification:
  multiclass_classification_model = ankh.ConvBertForMultiClassClassification(num_tokens=2, 
                                                                             input_dim=768, 
                                                                             nhead=4, 
                                                                             hidden_dim=384, 
                                                                             num_hidden_layers=1, 
                                                                             num_layers=1, 
                                                                             kernel_size=7, 
                                                                             dropout=0.2)

  # To use downstream model for regression:
  # training_labels_mean is optional parameter and it's used to fill the output layer's bias with it, 
  # it's useful for faster convergence.
  regression_model = ankh.ConvBertForRegression(input_dim=768, 
                                                nhead=4, 
                                                hidden_dim=384, 
                                                num_hidden_layers=1, 
                                                num_layers=1, 
                                                kernel_size=7, 
                                                dropout=0, 
                                                pooling='max', 
                                                training_labels_mean=0.38145)

Original downstream Predictions

Secondary Structure Prediction (Q3):

Model	CASP12	CASP14	TS115	CB513
Ankh 2 Large	84.18%	76.82%	88.59%	88.78%
Ankh Large	83.59%	77.48%	88.22%	88.48%
Ankh Base	80.81%	76.67%	86.92%	86.94%
ProtT5-XL-UniRef50	83.34%	75.09%	86.82%	86.64%
ESM2-15B	83.16%	76.56%	87.50%	87.35%
ESM2-3B	83.14%	76.75%	87.50%	87.44%
ESM2-650M	82.43%	76.97%	87.22%	87.18%
ESM-1b	79.45%	75.39%	85.02%	84.31%

Secondary Structure Prediction (Q8):

Model	CASP12	CASP14	TS115	CB513
Ankh 2 Large	72.90%	62.84%	79.88%	79.01%
Ankh Large	71.69%	63.17%	79.10%	78.45%
Ankh Base	68.85%	62.33%	77.08%	75.83%
ProtT5-XL-UniRef50	70.47%	59.71%	76.91%	74.81%
ESM2-15B	71.17%	61.81%	77.67%	75.88%
ESM2-3B	71.69%	61.52%	77.62%	75.95%
ESM2-650M	70.50%	62.10%	77.68%	75.89%
ESM-1b	66.02%	60.34%	73.82%	71.55%

Contact Prediction Long Precision Using Embeddings:

Model	ProteinNet (L/1)	ProteinNet (L/5)	CASP14 (L/1)	CASP14 (L/5)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	48.93%	73.49%	16.01%	29.91%
Ankh Base	43.21%	66.63%	13.50%	28.65%
ProtT5-XL-UniRef50	44.74%	68.95%	11.95%	24.45%
ESM2-15B	31.62%	52.97%	14.44%	26.61%
ESM2-3B	30.24%	51.34%	12.20%	21.91%
ESM2-650M	29.36%	50.74%	13.71%	22.25%
ESM-1b	29.25%	50.69%	10.18%	18.08%

Contact Prediction Long Precision Using attention scores:

Model	ProteinNet (L/1)	ProteinNet (L/5)	CASP14 (L/1)	CASP14 (L/5)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	31.44%	55.58%	11.05%	20.74%
Ankh Base	25.93%	46.28%	9.32%	19.51%
ProtT5-XL-UniRef50	30.85%	51.90%	8.60%	16.09%
ESM2-15B	33.32%	57.44%	12.25%	24.60%
ESM2-3B	33.92%	56.63%	12.17%	21.36%
ESM2-650M	31.87%	54.63%	10.66%	21.01%
ESM-1b	25.30%	42.03%	7.77%	15.77%

Localization (Q10):

Model	DeepLoc Dataset
Ankh 2 Large	82.57%
Ankh Large	83.01%
Ankh Base	81.38%
ProtT5-XL-UniRef50	82.95%
ESM2-15B	81.22%
ESM2-3B	81.22%
ESM2-650M	82.08%
ESM-1b	80.51%

Remote Homology:

Model	SCOPe (Fold)
Ankh 2 Large	62.09%
Ankh Large	61.01%
Ankh Base	61.14%
ProtT5-XL-UniRef50	59.38%
ESM2-15B	54.48%
ESM2-3B	59.24%
ESM2-650M	51.36%
ESM-1b	56.93%

Solubility:

Model	Solubility
Ankh 2 Large	75.86%
Ankh Large	76.41%
Ankh Base	76.36%
ProtT5-XL-UniRef50	76.26%
ESM2-15B	60.52%
ESM2-3B	74.91%
ESM2-650M	74.56%
ESM-1b	74.91%

Fluorescence (Spearman Correlation):

Model	Fluorescence
Ankh 2 Large	0.62
Ankh Large	0.62
Ankh Base	0.62
ProtT5-XL-UniRef50	0.61
ESM2-15B	0.56
ESM-1b	0.48
ESM2-650M	0.48
ESM2-3B	0.46

Nearest Neighbor Search using Global Pooling:

Model	Lookup69K (C)	Lookup69K (A)	Lookup69K (T)	Lookup69K (H)
Ankh 2 Large	In Progress	In Progress	In Progress	In Progress
Ankh Large	0.83	0.72	0.60	0.70
Ankh Base	0.85	0.77	0.63	0.72
ProtT5-XL-UniRef50	0.83	0.69	0.57	0.73
ESM2-15B	0.78	0.63	0.52	0.67
ESM2-3B	0.79	0.65	0.53	0.64
ESM2-650M	0.72	0.56	0.40	0.53
ESM-1b	0.78	0.65	0.51	0.63

Team

Technical University of Munich:

Ahmed Elnaggar	Burkhard Rost

Proteinea:

Hazem Essam	Wafaa Ashraf	Walid Moustafa	Mohamed Elkerdawy

University of Columbia:

Charlotte Rochereau

License

Ankh pretrained models are released under the under terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Community and Contributions

The Ankh project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.

Have a question?

We are happy to hear your question in our issues page Ankh! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via Hello.

Found a bug?

Feel free to file a new issue with a respective title and description on the the Ankh repository. If you already found a solution to your problem, we would love to review your pull request!.

✏️ Citation

If you use this code or our pretrained models for your publication, please cite the original paper:

@article{elnaggar2023ankh,
  title={Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling},
  author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
  journal={arXiv preprint arXiv:2301.06568},
  year={2023}
}

agemagician/Ankh