labstructbioinf/pLM-BLAST

"Multi-query input not implemented"

Closed this issue ยท 15 comments

What is the fastest way to handle multiple sequences? (e.g. without loading the db into memory each time?).

I know there are some sample scripts, but I'm having a hard time seeing how they relate to the simple, step-by-step example (that is only good for a single fasta file).

@DaRinker Thank you for your message. Currently, multi-query input is not implemented, but this is a feature that we will make available very soon.

@DaRinker Thank you for your message. Currently, multi-query input is not implemented, but this is a feature that we will make available very soon.

In the meantime, can you help me with the question I asked? I'm finding the single query process to run very slowly (much more slowly than the MPI web portal) and I think it's because the database is having to load into memory each time.

What is the (current) best practice for running 1000s of sequences locally?

Thank you

Can you contact me at stanislaw.dunin-horkawicz@tuebingen.mpg.de and we will figure out what the problem is?

Thanks for your help. Using the examples.sh script helped cut the processing time by about 60%

Hi, I am having 2 million sequences (query) and I want to do homology search of these against 250 sequence database. Currently these is no multi query option. For 2 million sequences, running each query individually will take such a long time. Is there any way to fast pLMBLAST?

@Citugulia40 Hi! The multi-query option is already implemented, but needs some testing before merging with the main branch. We expect to release it very soon along with other updates. cc @DaRinker

Thank you so much.
Sorry to ask you, Are you expecting the multi-query to be merged in the main branch this month or later than this?

Definitely this month, maybe even next week.

Hi, I'm curious to know if you have an estimate of how long pLMBLAST would take to execute when running 2 million query sequences against a database of 250 sequences (after implementing the multi-query option)?

We will provide a speed benchmark along with the multi-query support. As a rough estimate, the ECOD benchmark (all-versus-all comparison of 1500 sequences) took about 30 minutes on a 20-core CPU. Running times will depend heavily on the cosine similarity cutoff and the length of the sequences. You may also want to consider clustering your 2M sequences to 40-50% identity at a high coverage cutoff (e.g. with MMSeqs2). Given the sensitivity of pLM-based methods, searching with 1-2 examples per cluster should be sufficient.

Thank you so much for your kind support, Eagerly waiting for the multi-query option.

Definitely this month, maybe even next week.

Hi, I just want to ask that is there any update regarding the multi-query option?

Thanks in advance

Hi, all changes are in: https://github.com/labstructbioinf/pLM-BLAST/tree/multi_query_feature i will merge them on Thursday. There is still some work to do

Ok, Thank you very much

Changes are now live, looking forward for your feedback :)