/Ankh

Ankh: Optimized Protein Language Model

Primary LanguagePythonOtherNOASSERTION


Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling



Ankh is the first general-purpose protein language model trained on Google's TPU-V4 surpassing the state-of-the-art performance with dramatically less parameters, promoting accessibility to research innovation via attainable resources.

This repository will be updated regulary with new pre-trained models for proteins in part of supporting the biotech community in revolutinizing protein engineering using AI.

Table of Contents

  Installation

python -m pip install ankh

  Models Availability

Model ankh Hugging Face
Ankh Large ankh.load_large_model() Ankh Large
Ankh Base ankh.load_base_model() Ankh Base

  Datasets Availability

Dataset Hugging Face
Remote Homology load_dataset("proteinea/remote_homology")
CASP12 load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP12.csv']})
CASP14 load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CASP14.csv']})
CB513 load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['CB513.csv']})
TS115 load_dataset("proteinea/secondary_structure_prediction", data_files={'test': ['TS115.csv']})
DeepLoc load_dataset("proteinea/deeploc")
Fluorescence load_dataset("proteinea/fluorescence")
Solubility load_dataset("proteinea/solubility")
Nearest Neighbor Search load_dataset("proteinea/nearest_neighbor_search")

  Usage

  • Loading pre-trained models:
  import ankh

  # To load large model:
  model, tokenizer = ankh.load_large_model()
  model.eval()


  # To load base model.
  model, tokenizer = ankh.load_base_model()
  model.eval()
  • Feature extraction using ankh large example:
  model, tokenizer = ankh.load_large_model()
  model.eval()

  protein_sequences = ['MKALCLLLLPVLGLLVSSKTLCSMEEAINERIQEVAGSLIFRAISSIGLECQSVTSRGDLATCPRGFAVTGCTCGSACGSWDVRAETTCHCQCAGMDWTGARCCRVQPLEHHHHHH', 
  'GSHMSLFDFFKNKGSAATATDRLKLILAKERTLNLPYMEEMRKEIIAVIQKYTKSSDIHFKTLDSNQSVETIEVEIILPR']

  protein_sequences = [list(seq) for seq in protein_sequences]


  outputs = tokenizer.batch_encode_plus(protein_sequences, 
                                    add_special_tokens=True, 
                                    padding=True, 
                                    is_split_into_words=True, 
                                    return_tensors="pt")
  with torch.no_grad():
    embeddings = model(input_ids=outputs['input_ids'], attention_mask=outputs['attention_mask'])
  • Loading downstream models example:
  # To use downstream model for binary classification:
  binary_classification_model = ankh.ConvBertForBinaryClassification(input_dim=768, 
                                                                     nhead=4, 
                                                                     hidden_dim=384, 
                                                                     num_hidden_layers=1, 
                                                                     num_layers=1, 
                                                                     kernel_size=7, 
                                                                     dropout=0.2, 
                                                                     pooling='max')

  # To use downstream model for multiclass classification:
  multiclass_classification_model = ankh.ConvBertForMultiClassClassification(num_tokens=2, 
                                                                             input_dim=768, 
                                                                             nhead=4, 
                                                                             hidden_dim=384, 
                                                                             num_hidden_layers=1, 
                                                                             num_layers=1, 
                                                                             kernel_size=7, 
                                                                             dropout=0.2)

  # To use downstream model for regression:
  # training_labels_mean is optional parameter and it's used to fill the output layer's bias with it, 
  # it's useful for faster convergence.
  regression_model = ankh.ConvBertForRegression(input_dim=768, 
                                                nhead=4, 
                                                hidden_dim=384, 
                                                num_hidden_layers=1, 
                                                num_layers=1, 
                                                kernel_size=7, 
                                                dropout=0, 
                                                pooling='max', 
                                                training_labels_mean=0.38145)

  Original downstream Predictions

  •   Secondary Structure Prediction (Q3):
Model CASP12 CASP14 TS115 CB513
Ankh 2 Large 84.18% 76.82% 88.59% 88.78%
Ankh Large 83.59% 77.48% 88.22% 88.48%
Ankh Base 80.81% 76.67% 86.92% 86.94%
ProtT5-XL-UniRef50 83.34% 75.09% 86.82% 86.64%
ESM2-15B 83.16% 76.56% 87.50% 87.35%
ESM2-3B 83.14% 76.75% 87.50% 87.44%
ESM2-650M 82.43% 76.97% 87.22% 87.18%
ESM-1b 79.45% 75.39% 85.02% 84.31%

  •   Secondary Structure Prediction (Q8):
Model CASP12 CASP14 TS115 CB513
Ankh 2 Large 72.90% 62.84% 79.88% 79.01%
Ankh Large 71.69% 63.17% 79.10% 78.45%
Ankh Base 68.85% 62.33% 77.08% 75.83%
ProtT5-XL-UniRef50 70.47% 59.71% 76.91% 74.81%
ESM2-15B 71.17% 61.81% 77.67% 75.88%
ESM2-3B 71.69% 61.52% 77.62% 75.95%
ESM2-650M 70.50% 62.10% 77.68% 75.89%
ESM-1b 66.02% 60.34% 73.82% 71.55%

  •   Contact Prediction Long Precision Using Embeddings:
Model ProteinNet (L/1) ProteinNet (L/5) CASP14 (L/1) CASP14 (L/5)
Ankh 2 Large In Progress In Progress In Progress In Progress
Ankh Large 48.93% 73.49% 16.01% 29.91%
Ankh Base 43.21% 66.63% 13.50% 28.65%
ProtT5-XL-UniRef50 44.74% 68.95% 11.95% 24.45%
ESM2-15B 31.62% 52.97% 14.44% 26.61%
ESM2-3B 30.24% 51.34% 12.20% 21.91%
ESM2-650M 29.36% 50.74% 13.71% 22.25%
ESM-1b 29.25% 50.69% 10.18% 18.08%

  •   Contact Prediction Long Precision Using attention scores:
Model ProteinNet (L/1) ProteinNet (L/5) CASP14 (L/1) CASP14 (L/5)
Ankh 2 Large In Progress In Progress In Progress In Progress
Ankh Large 31.44% 55.58% 11.05% 20.74%
Ankh Base 25.93% 46.28% 9.32% 19.51%
ProtT5-XL-UniRef50 30.85% 51.90% 8.60% 16.09%
ESM2-15B 33.32% 57.44% 12.25% 24.60%
ESM2-3B 33.92% 56.63% 12.17% 21.36%
ESM2-650M 31.87% 54.63% 10.66% 21.01%
ESM-1b 25.30% 42.03% 7.77% 15.77%

  •   Localization (Q10):
Model DeepLoc Dataset
Ankh 2 Large 82.57%
Ankh Large 83.01%
Ankh Base 81.38%
ProtT5-XL-UniRef50 82.95%
ESM2-15B 81.22%
ESM2-3B 81.22%
ESM2-650M 82.08%
ESM-1b 80.51%

  •   Remote Homology:
Model SCOPe (Fold)
Ankh 2 Large 62.09%
Ankh Large 61.01%
Ankh Base 61.14%
ProtT5-XL-UniRef50 59.38%
ESM2-15B 54.48%
ESM2-3B 59.24%
ESM2-650M 51.36%
ESM-1b 56.93%

  •   Solubility:
Model Solubility
Ankh 2 Large 75.86%
Ankh Large 76.41%
Ankh Base 76.36%
ProtT5-XL-UniRef50 76.26%
ESM2-15B 60.52%
ESM2-3B 74.91%
ESM2-650M 74.56%
ESM-1b 74.91%

  •   Fluorescence (Spearman Correlation):
Model Fluorescence
Ankh 2 Large 0.62
Ankh Large 0.62
Ankh Base 0.62
ProtT5-XL-UniRef50 0.61
ESM2-15B 0.56
ESM-1b 0.48
ESM2-650M 0.48
ESM2-3B 0.46

  •   Nearest Neighbor Search using Global Pooling:
Model Lookup69K (C) Lookup69K (A) Lookup69K (T) Lookup69K (H)
Ankh 2 Large In Progress In Progress In Progress In Progress
Ankh Large 0.83 0.72 0.60 0.70
Ankh Base 0.85 0.77 0.63 0.72
ProtT5-XL-UniRef50 0.83 0.69 0.57 0.73
ESM2-15B 0.78 0.63 0.52 0.67
ESM2-3B 0.79 0.65 0.53 0.64
ESM2-650M 0.72 0.56 0.40 0.53
ESM-1b 0.78 0.65 0.51 0.63

  Team

  • Technical University of Munich:
Ahmed Elnaggar Burkhard Rost
  • Proteinea:
Hazem Essam Wafaa Ashraf Walid Moustafa Mohamed Elkerdawy
  • University of Columbia:
Charlotte Rochereau

  Sponsors

Google Cloud

  License

Ankh pretrained models are released under the under terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

  Community and Contributions

The Ankh project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.

  Have a question?

We are happy to hear your question in our issues page Ankh! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via Hello.

  Found a bug?

Feel free to file a new issue with a respective title and description on the the Ankh repository. If you already found a solution to your problem, we would love to review your pull request!.

✏️  Citation

If you use this code or our pretrained models for your publication, please cite the original paper:

@article{elnaggar2023ankh,
  title={Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling},
  author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
  journal={arXiv preprint arXiv:2301.06568},
  year={2023}
}