Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling
Ankh is the first general-purpose protein language model trained on Google's TPU-V4 surpassing the state-of-the-art performance with dramatically less parameters, promoting accessibility to research innovation via attainable resources.
This repository will be updated regulary with new pre-trained models for proteins in part of supporting the biotech community in revolutinizing protein engineering using AI.
importankh# To load large model:model, tokenizer=ankh.load_large_model()
model.eval()
# To load base model.model, tokenizer=ankh.load_base_model()
model.eval()
# To use downstream model for binary classification:binary_classification_model=ankh.ConvBertForBinaryClassification(input_dim=768,
nhead=4,
hidden_dim=384,
num_hidden_layers=1,
num_layers=1,
kernel_size=7,
dropout=0.2,
pooling='max')
# To use downstream model for multiclass classification:multiclass_classification_model=ankh.ConvBertForMultiClassClassification(num_tokens=2,
input_dim=768,
nhead=4,
hidden_dim=384,
num_hidden_layers=1,
num_layers=1,
kernel_size=7,
dropout=0.2)
# To use downstream model for regression:# training_labels_mean is optional parameter and it's used to fill the output layer's bias with it, # it's useful for faster convergence.regression_model=ankh.ConvBertForRegression(input_dim=768,
nhead=4,
hidden_dim=384,
num_hidden_layers=1,
num_layers=1,
kernel_size=7,
dropout=0,
pooling='max',
training_labels_mean=0.38145)
Original downstream Predictions
Secondary Structure Prediction (Q3):
Model
CASP12
CASP14 (HARD)
TS115
CB513
Ankh Large
83.59%
77.48%
88.22%
88.48%
Ankh Base
80.81%
76.67%
86.92%
86.94%
ProtT5-XL-UniRef50
83.34%
75.09%
86.82%
86.64%
ESM2-15B
83.16%
76.56%
87.50%
87.35%
ESM2-3B
83.14%
76.75%
87.50%
87.44%
ESM2-650M
82.43%
76.97%
87.22%
87.18%
ESM-1b
79.45%
75.39%
85.02%
84.31%
Secondary Structure Prediction (Q8):
Model
CASP12
CASP14 (HARD)
TS115
CB513
Ankh Large
71.69%
63.17%
79.10%
78.45%
Ankh Base
68.85%
62.33%
77.08%
75.83%
ProtT5-XL-UniRef50
70.47%
59.71%
76.91%
74.81%
ESM2-15B
71.17%
61.81%
77.67%
75.88%
ESM2-3B
71.69%
61.52%
77.62%
75.95%
ESM2-650M
70.50%
62.10%
77.68%
75.89%
ESM-1b
66.02%
60.34%
73.82%
71.55%
Contact Prediction Long Precision Using Embeddings:
Model
ProteinNet (L/1)
ProteinNet (L/5)
CASP14 (L/1)
CASP14 (L/5)
Ankh Large
48.93%
73.49%
16.01%
29.91%
Ankh Base
43.21%
66.63%
13.50%
28.65%
ProtT5-XL-UniRef50
44.74%
68.95%
11.95%
24.45%
ESM2-15B
31.62%
52.97%
14.44%
26.61%
ESM2-3B
30.24%
51.34%
12.20%
21.91%
ESM2-650M
29.36%
50.74%
13.71%
22.25%
ESM-1b
29.25%
50.69%
10.18%
18.08%
Contact Prediction Long Precision Using attention scores:
The Ankh project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.
Have a question?
We are happy to hear your question in our issues page Ankh! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via Hello.
Found a bug?
Feel free to file a new issue with a respective title and description on the the Ankh repository. If you already found a solution to your problem, we would love to review your pull request!.
✏️ Citation
If you use this code or our pretrained models for your publication, please cite the original paper:
@article{elnaggar2023ankh,
title={Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling},
author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
journal={arXiv preprint arXiv:2301.06568},
year={2023}
}