"Multi-query input not implemented"

Question

"Multi-query input not implemented"

Closed this issue a year ago · 15 comments

What is the fastest way to handle multiple sequences? (e.g. without loading the db into memory each time?).

I know there are some sample scripts, but I'm having a hard time seeing how they relate to the simple, step-by-step example (that is only good for a single fasta file).

Answer 1 · 2023-09-11T10:28:05.000Z

@DaRinker Thank you for your message. Currently, multi-query input is not implemented, but this is a feature that we will make available very soon.

Answer 2 · 2023-09-11T12:14:33.000Z

@DaRinker Thank you for your message. Currently, multi-query input is not implemented, but this is a feature that we will make available very soon.

In the meantime, can you help me with the question I asked? I'm finding the single query process to run very slowly (much more slowly than the MPI web portal) and I think it's because the database is having to load into memory each time.

What is the (current) best practice for running 1000s of sequences locally?

Thank you

Answer 3 · 2023-09-11T14:51:43.000Z

Can you contact me at stanislaw.dunin-horkawicz@tuebingen.mpg.de and we will figure out what the problem is?

Answer 4 · 2023-09-13T19:53:01.000Z

Thanks for your help. Using the examples.sh script helped cut the processing time by about 60%

Answer 5 · 2023-10-06T18:40:44.000Z

Hi, I am having 2 million sequences (query) and I want to do homology search of these against 250 sequence database. Currently these is no multi query option. For 2 million sequences, running each query individually will take such a long time. Is there any way to fast pLMBLAST?

Answer 6 · 2023-10-06T18:58:54.000Z

@Citugulia40 Hi! The multi-query option is already implemented, but needs some testing before merging with the main branch. We expect to release it very soon along with other updates. cc @DaRinker

Answer 7 · 2023-10-06T19:04:17.000Z

Thank you so much.
Sorry to ask you, Are you expecting the multi-query to be merged in the main branch this month or later than this?

Answer 8 · 2023-10-06T19:06:18.000Z

Definitely this month, maybe even next week.

Answer 9 · 2023-10-10T16:18:39.000Z

Hi, I'm curious to know if you have an estimate of how long pLMBLAST would take to execute when running 2 million query sequences against a database of 250 sequences (after implementing the multi-query option)?

Answer 10 · 2023-10-10T16:26:40.000Z

We will provide a speed benchmark along with the multi-query support. As a rough estimate, the ECOD benchmark (all-versus-all comparison of 1500 sequences) took about 30 minutes on a 20-core CPU. Running times will depend heavily on the cosine similarity cutoff and the length of the sequences. You may also want to consider clustering your 2M sequences to 40-50% identity at a high coverage cutoff (e.g. with MMSeqs2). Given the sensitivity of pLM-based methods, searching with 1-2 examples per cluster should be sufficient.

Answer 11 · 2023-10-10T16:30:15.000Z

Thank you so much for your kind support, Eagerly waiting for the multi-query option.

Answer 12 · 2023-10-17T01:37:47.000Z

Definitely this month, maybe even next week.

Hi, I just want to ask that is there any update regarding the multi-query option?

Thanks in advance

Answer 13 · 2023-10-18T08:46:12.000Z

Hi, all changes are in: https://github.com/labstructbioinf/pLM-BLAST/tree/multi_query_feature i will merge them on Thursday. There is still some work to do

Answer 14 · 2023-10-18T14:06:03.000Z

Ok, Thank you very much

Answer 15 · 2023-10-19T19:29:23.000Z

Changes are now live, looking forward for your feedback :)