evolutionaryscale/esm

Batch Support for Obtaining Residue Embeddings

Opened this issue · 3 comments

I am currently trying to obtain residue embeddings for protein sequences. The typical workflow involves the following steps:

protein = ESMProtein(sequence=sequence)
protein_tensor = self.model.encode(protein)
config = SamplingConfig(return_per_residue_embeddings=True)
output = client.forward_and_sample(protein_tensor, config)
embeddings = output.per_residue_embedding

However, I don't know how to get embeddings in batch mode. I checked the example in esm/examples/local_generate.py (lines 129-135), but it only shows the batch_generate function, which does not include a way to obtain embeddings. How can I achieve embeddings with batch?

Bumping this issue, I am also interested in learning if the batching function for generating embeddings is ready yet, and if possible, a small example script showing showing a potential use-case. In the mean time, could you theoretically loop through a list of fasta's and generate embeddings one at a time, or would there be a reason you would want to generate embeddings in batches?

We currently don't have support for this, though it shouldn't be too bad to implement. You can definitely just loop through and generate one at a time unless you're running into speed concerns.

Hi @Junseok0207 @winatony @ebetica, my group made a wrapper for this that has full Huggingface integration and batching.
https://huggingface.co/Synthyra/ESMplusplus_small