proteins going missing at find best blast hit step.

Question

proteins going missing at find best blast hit step.

webbchen opened this issue 6 years ago · 1 comments

Dear developers

I've used porthoMCL to cluster proteins from 10 isolates of a fungus.
Most of them get clustered alright but some proteins disappear entirely in the find best hits step. Checking manually these proteins do have counterparts in the other isolates, sometimes 100 percent matches over the entire length.
Also, Blastp finds them as they're included in the blastp output files. These matches also survive the splitSimSeq step but then get lost at the find the best blast hit step.

I've run:
porthomclPairsBestHit.py -t taxon_list -s splitSimSeq -b besthit -q paralogTemp -x 1
(repeating this for x 2 to 10)
I saw no error messages.
I'm happy to send some files in confidence for you to try and replicate the problem.

Kind regards and many thanks!
A Webb

Answer 1 · 2018-12-07T08:19:49.000Z

Dear developers,

I have been using PorthoMCL for my analysis in the past months, but I realized I encountered the very same problem as @webbchen reported.

I am handling several well-studied evolutionary conserved genes, and from BLASTP (step 3) some of them are predicted to have e-value of 0.0, but then all these genes are LOST in the subsequent SplitSimSeq (step 4).

I tried to track what happened in the result tables, and here is the instance where I am currently stuck:
Some Gene 1 is highly conserved in species A, B, C, and D. In the BLAST result tables, the entries where AGene1 is blast-ed against (the conserved gene counterpart in) B genome / C genome / D genome all have e-value 0.0. Reciprocal BLASTings show similar results where some of the calculated e-value are slightly larger than 0 - say, 1e-120. However, in the SplitSimSeq result table (.ss.tsv) table, only those entries with e-value large than 0 (e.g. those with 1e-120) are included, but all entries with e-value = 0 are missed.

So these entries with e-value = 0 are completely missed in subsequent tables, such as the best-hit table, paralog tables, and also the ortholog tables.

I tried to use the same source genome as input for the traditional ORTHOMCL, but these genes are clustered nicely (since almost all of them are highly conserved with e-value 0.0). (We're actually dealing with some novel genome so it may be a bit difficult for me to share the whole set of data to you for de-bugging, but I think what I've tracked may be one fatal reason)

May I know if it is possible for you to try de-bugging this potential error ASAP?
Many thanks!

Best,
Jason.