Potential bug: "WARNING: GO:0006379 NOT FOUND IN DAG"
Maxim-Karpov opened this issue · 7 comments
Hello, I've realised that there may be a potential bug with the enrichment tool where an entry in the obo is considered obsolete. For example in the following entries:
**WARNING: GO:0000469 NOT FOUND IN DAG
**WARNING: GO:0006379 NOT FOUND IN DAG
**WARNING: GO:0010862 NOT FOUND IN DAG
**WARNING: GO:0014065 NOT FOUND IN DAG
**WARNING: GO:0014066 NOT FOUND IN DAG
**WARNING: GO:0016307 NOT FOUND IN DAG
**WARNING: GO:0030579 NOT FOUND IN DAG
**WARNING: GO:0031532 NOT FOUND IN DAG
**WARNING: GO:0035551 NOT FOUND IN DAG
**WARNING: GO:0042779 NOT FOUND IN DAG
**WARNING: GO:0043046 NOT FOUND IN DAG
**WARNING: GO:0043629 NOT FOUND IN DAG
**WARNING: GO:0043631 NOT FOUND IN DAG
**WARNING: GO:0047690 NOT FOUND IN DAG
**WARNING: GO:0048017 NOT FOUND IN DAG
**WARNING: GO:0061088 NOT FOUND IN DAG
**WARNING: GO:0070084 NOT FOUND IN DAG
**WARNING: GO:0090502 NOT FOUND IN DAG
**WARNING: GO:0098789 NOT FOUND IN DAG
**WARNING: GO:0102176 NOT FOUND IN DAG
**WARNING: GO:0102756 NOT FOUND IN DAG
**WARNING: GO:1903204 NOT FOUND IN DAG
These have been replaced by a different GO term but the goatools considers them as absent. It would be nice if the program replaced these terms for the user (if the replacement is present), or counted them in regardless of the obsolete status (an option for this).
thanks you. this is a great suggestion.
I am trying to understand this better and see how we can do this in a non-ambiguous way.
what should happen if there are multiple replacements?
are the replacements semantically the same (or split from the old terms, in which case the semantic meaning changes)
thanks you. this is a great suggestion. I am trying to understand this better and see how we can do this in a non-ambiguous way.
what should happen if there are multiple replacements? are the replacements semantically the same (or split from the old terms, in which case the semantic meaning changes)
Perhaps all of the available replacements could be substituted into the analysis. As far as I've seen, the replacements tend to be very similar to their obsolete categories. For example obsolete GO term "cleavage involved in rRNA processing GO:0000469" is replaced by "rRNA processing GO:0006364".
This seems to be more complicated than I thought as there are also consider
tags. Furthermore, some term replacements can be crude simplifications/abstractions of the originals e.g. "obsolete chaperonin ATPase activity GO:0003763" is replaced by "ATP hydrolysis activity GO:0016887". Given that only 1 replacement term is ever available per obsolete id, it is arguably justifiable to simply replace them in the analysis.
Here's the code to extract all ids, replacements, and considerations for all obsolete entries from the obo file FYI (credit: @iquasere):
awk 'BEGIN {print "id\treplaced_by\tconsider"}
/^\[Term\]/{if(is_obsolete) print id"\t"replaced_by"\t"consider; is_obsolete=id=replaced_by=consider=""; next}
/^id:/{id=$2}
/^is_obsolete: true/{is_obsolete=1}
/^replaced_by:/{replaced_by=replaced_by ? replaced_by";"$2 : $2}
/^consider:/{consider=consider ? consider";"$2 : $2}' go-basic.obo > go-basic.tsv
Thank you for the deep dive on this.
I'll attempt a fix this weekend - perhaps bringing in both replaced_by
and consider
.
I just had a commit adding an option --obsolete
to find_enrichment.py
.
--obsolete {keep,replace,skip}
Strategy for handling obsolete GO terms (default: skip)
The replace
strategy updates the obsolete GO term with terms suggested in replaced_by
and consider
. Please note that the default behavior stays the same, which is to skip the obsolete terms.
Thank you again for the great idea - and please let me know if there's an issue.