/TRILL

Sandbox for Deep-Learning based Computational Protein Design

Primary LanguagePythonMIT LicenseMIT

                          _____________________.___.____    .____     
                          \__    ___/\______   \   |    |   |    |    
                            |    |    |       _/   |    |   |    |    
                            |    |    |    |   \   |    |___|    |___ 
                            |____|    |____|_  /___|_______ \_______ \
                                             \/            \/       \/

pypi version Downloads license Documentation Status

Intro

TRILL (TRaining and Inference using the Language of Life) is a sandbox for creative protein engineering and discovery. As a bioengineer myself, deep-learning based approaches for protein design and analysis are of great interest to me. However, many of these deep-learning models are rather unwieldy, especially for non ML-practitioners due to their sheer size. Not only does TRILL allow researchers to perform inference on their proteins of interest using a variety of models, but it also democratizes the efficient fine-tuning of large-language models. Whether using Google Colab with one GPU or a supercomputer with many, TRILL empowers scientists to leverage models with millions to billions of parameters without worrying (too much) about hardware constraints. Currently, TRILL supports using these models as of v1.8.0:

Breakdown of TRILL's Commands

Command Function Available Models
Embed Generates numerical representations or "embeddings" of protein sequences for quantitative analysis and comparison. ESM2, ProtT5-XL, ProstT5, Ankh
Visualize Creates interactive 2D visualizations of embeddings for exploratory data analysis. PCA, t-SNE, UMAP
Finetune Finetunes protein language models for specific tasks. ESM2, ProtGPT2, ZymCTRL
Language Model Protein Generation Generates proteins using pretrained language models. ESM2, ProtGPT2, ZymCTRL
Inverse Folding Protein Generation Designs proteins to fold into specific 3D structures. ESM-IF1, LigandMPNN, ProstT5
Diffusion Based Protein Generation Uses denoising diffusion models to generate proteins. RFDiffusion
Fold Predicts 3D protein structures. ESMFold, ProstT5
Dock Simulates protein-ligand interactions. DiffDock, Smina, Autodock Vina, Lightdock, GeoDock
Classify Predicts protein properties with pretrained models or train custom classifiers TemStaPro, EpHod, ECPICK, LightGBM, XGBoost, Isolation Forest
Regress Train custom regression models. LightGBM, Linear
Simulate Uses molecular dynamics to simulate protein-ligand interactions. OpenMM
Score Utilize ESM1v or ESM2 to score protein sequences or ProteinMPNN to score protein structures in a zero-shot manner. COMPSS

Documentation

Check out the documentation and examples at https://trill.readthedocs.io/en/latest/index.html