ntranoslab/esm-variants

Parallel scoring of sequences on GPU

Closed this issue · 4 comments

Hi, I wanted to ask if it would be perhaps possible to add parallel/batch scoring of multiple sequences concurrently on a GPU? I am currently using the ESM1v ensemble on datasets of hundreds of sequences, so I need to run them through all 5 ESM1v models, which adds up to quite a bit of compute time.

I am using Nvidia L4 GPU, which has 24GBs of RAM. I noticed that the model takes up only 4-5GBs of that when running, so I tried running multiple ESM1v scoring processes in parallel. It kind of works, but I'm only getting a total speed up of maybe 10-15% vs running the models sequentially, which is nowhere near what one could (perhaps naively) expect.

I looked into the code and the get_wt_LLR function operates on individual sequences. Would it be possible to update it to score multiple sequences at once, assuming there is enough GPU resources?

@jacekzkominek
We have never tested the ESM models on minibatches of protein sequences. The added padding could in theory affect the accuracy of the predictions, but I doubt it will have a meaningful effect in practice (as I assume the model was trained with padding and learned to ignore it).
I don't think we'll have the time to implement and test this feature in the near future. If you decide to implement it on your own and would like to make a pull request, we would appreciate it.

@nadavbra , thanks for the response, will look into it. I'm curious, do you think it would affect things much if the sequences in a minibatch were aligned or simply tail-padded with gaps/Xs to the length of the longest one? I know some other models, like AlphaFold for example used MSAs specifically, and even the original ESM itself has a MSABatchConverter class as an alternative to the "regular" BatchConverter. It would affect the selection of which sequences can be run together and which ones cannot etc.

@nadavbra, I played a bit with trying to improve performance by fusing proteins together, padded by a long gap in-between. Even though you warn against using non-standard amino acids, the results from the model actually do have the scores for B,O,S,U,X and gaps, so I just added the '-' to the AAIndex and the result dataframe to see how it affects the results. Interestingly, the final scores (ignoring the gaps, of course) were very similar between two protein sequences run separately, and as a single sequence with a 1000x gap, with a mean diff of ~0.3. Not identical, but very close for the most part. Unfortunately, turned out there are no performance benefits to such fusions at scale because of tiling and having to re-run the scores many times over the windows anyways, which ends up actually taking longer that running things separately. Oh well, onto trying something else..

@jacekzkominek Thank you for reporting back your experiments.
Have you tried setting the batches such that each batch would have protein of more or less the same length (and the same number of windows)? I expect it would improve performance.
About MSA, I've never tried to use ESM in such a way. I don't have a strong intuition as to what extent it will improve predictions.