tanghaibao/goatools

Information content of a custom GO annotation.

Opened this issue · 1 comments

Hello,
Thanks for developing and maintaining this amazing resource.

There is an instruction to obtain information content for each GO term in human, mouse, and fly GOA.

Could you also show how to calculate the information content for custom GOA in id2go format (i.e., gene ID in the 1st column and GO terms concatenated with ";" in the 2nd column) for a non-model species (i.e., not human, mouse, fly, etc.)?

I would like to compare custom GOAs created by different tools and see which had called more specific GO terms and the information contents of GO terms shared by both GOAs or unique to each GOA.

Thanks!
Dong-Ha

Could you also show how to calculate the information content for custom GOA in id2go format (i.e., gene ID in the 1st column and GO terms concatenated with ";" in the 2nd column) for a non-model species (i.e., not human, mouse, fly, etc.)?

I realize this could be achieved by adopting a code in a previous issue (#196):
#196 (comment)

I have a couple of questions below.
...

Q1. When I tried the code in #196, tests module was not recognized:

>>> from tests.utils import get_godag
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'tests'

Can I replace the following:

from tests.utils import get_godag
godag = get_godag('go-basic.obo')

... with this?

from goatools.obo_parser import GODag
godag = GODag("go-basic.obo")

Will these two do the same?

...

Q2. When I tried the codes posted in #196 (with the modification of the part to get godag), I received a warning at the step getting the tcntobj that says "GODAG is none":

>>> from goatools.anno.idtogos_reader import IdToGosReader
>>> from goatools.semantic import TermCounts
>>> from goatools.semantic import get_info_content
>>> from goatools.obo_parser import GODag
>>> godag = GODag("go-basic.obo")
go-basic.obo: fmt(1.2) rel(2023-03-06) 46,579 Terms
>>> annoobj = IdToGosReader("temp.id2go")
HMS:0:00:00.074716  21,312 annotations READ: temp.id2go 
>>> tcntobj = TermCounts(godag, annoobj.get_id2gos())
WARNING:root:IdToGosReader.get_id2gos: GODAG is None. IGNORING namespace(None). If you are running `map_to_slim.py`, this warning can be ignored.
8333 IDs in all associations
>>> go_id = 'GO:0004930'
>>> info_content = get_info_content(go_id, tcntobj)
>>> print('IC={IC:.05f} {GO} {NAME}'.format(GO=go_id, IC=info_content, NAME=godag[go_id].name))
IC=4.04149 GO:0004930 G protein-coupled receptor activity

What is the meaning of this warning, and can it be ignored?
Plus, I would appreciate it if you could confirm that the IC value printed here would be as the GOAtools functions intended (i.e., correct).

@dvklopfenstein, thank you in advance!

Best,
Dong-Ha