/ProtEnc

Extract protein embeddings the easy way.

Primary LanguagePythonMIT LicenseMIT

ProtEnc: generate protein embeddings the easy way

ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:

Usage

Installation

pip install protenc

Python API

import protenc

# List available models
print(protenc.list_models())

# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')

proteins = [
  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]

for embed in encoder(proteins, return_format='numpy'):
  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.
  print(embed.shape)
  
  # Derive a single per-protein embedding vector by averaging along the sequence dimension
  embed.mean(0)

Command-line interface

After installation, use the protenc shell command for bulk generation and export of protein embeddings.

protenc sequences.fasta embeddings.lmdb --model_name=<name-of-model>

By default, input and output formats are inferred from the file extensions.

Run

protenc --help

for a detailed usage description.

Example

Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:

protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D

The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb utility function:

from protenc.utils import read_from_lmdb

for label, embed in read_from_lmdb('embeddings.lmdb'):
    print(label, embed)

Features

Input formats:

Output format:

General:

  • Multi-GPU inference with (--data_parallel)
  • FP16 inference (--amp)

Development

Clone the repository:

git clone git+https://github.com/kklemon/protenc.git

Install dependencies via Poetry:

poetry install

Contribution

Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.

Todo

  • Support for more input formats
    • CSV
    • Parquet
    • FASTA
    • JSON
  • Support for more output formats
    • LMDB
    • HDF5
    • DataFrame
    • Pickle
  • Support for large models
    • Model offloading
    • Sharding
    • FlashAttention (via Kernl?)
  • Support for more protein language models
    • Whole ProtTrans family
    • Whole ESM family
    • AlphaFold (?)
  • Implement all remaining TODOs in code
  • Evaluation
  • Demos
  • Distributed inference
  • Maybe support some sort of optimized inference such as quantization
    • This may be up to the model providers
  • Improve documentation
  • Support translation of gene sequences
  • Add tests. We need tests!!!