How to interpret OrthologuesStats*.tsv files?
bbalog87 opened this issue · 4 comments
Hello,
it is not clear to my me how to interpret the files in OrthologuesStats*.tsv.
For instance, this matrix from the OrthologuesStats_one-to-one.tsv file is not symmetric. It is not clear how to infer the total number of one-to-one orthologs for ach species. Is it the rows sum or either le columns sum?
Chaar Latma Pagma Parch Perfl Sanlu Silsi
Chaar 0.0 12540.0 11361.0 13307.0 9480.0 6867.0 14564.0
Latma 12540.0 0.0 10242.0 11736.0 8457.0 6323.0 12891.0
Pagma 11361.0 10242.0 0.0 11388.0 7496.0 6068.0 12220.0
Parch 13307.0 11736.0 11388.0 0.0 9261.0 6840.0 13963.0
Perfl 9480.0 8457.0 7496.0 9261.0 0.0 5292.0 9781.0
Sanlu 6867.0 6323.0 6068.0 6840.0 5292.0 0.0 7035.0
Silsi 14564.0 12891.0 12220.0 13963.0 9781.0 7035.0 0.0
Thank you,
Julien
Hi Julien
I've just checked the matrix in your post and it is symmetric, e.g. it reports that the number of one-to-one orthologs between Chaar and Latma is 12540 and that is the same if you look at the M(1,0) entry of the matrix or the M(0,1) entry. So for each pair of species the corresponding number in the matrix is the number of one-to-one orthologs between that pair of species. You don't need to take the sum over the rows of columns.
All the best
David
Hi David,
Thanks for the helpful answer. I have now understood how to read the matrix.
How about this one-to-many matrix?
Chaar Latma Pagma Parch Perfl Sanlu Silsi
Chaar 0.0 816.0 1269.0 1511.0 6740.0 597.0 871.0
Latma 431.0 0.0 1156.0 1388.0 5992.0 625.0 787.0
Pagma 1007.0 1218.0 0.0 2021.0 6088.0 738.0 1387.0
Parch 383.0 690.0 1110.0 0.0 6552.0 513.0 714.0
Perfl 197.0 441.0 772.0 706.0 0.0 448.0 430.0
Sanlu 239.0 426.0 719.0 827.0 3587.0 0.0 431.0
Silsi 441.0 822.0 1135.0 1492.0 6923.0 608.0 0.0
Best,
Julien
Hi Julien
The reason for this is that it's not a symmetrical (e.g. one-to-one) relationship. Thanks for bringing this up, below is a explanation of how this works. I'll add something to the README file to describe these results files more fully as I realise now that there's not enough info for users to interpret them currently.
For some gene trees you will have multiple duplication events post-speciation. This could lead to, for example, 2 genes in Latma being orthologs of 3 genes in Chaar. All of these occurrences are summed up in the many-to-many matrix. This case would add 2 to the entry for M(Latma, Chaar) and 3 to the entry for M(Chaar, Latma). This is a tree showing 3 genes in arabidopsis (AT2G07671, ATMG01080, ATMG00040) that are orthologs to 2 genes in volvox (Vocar.0009s0017.1, Vocar.0009s0018.1):
For the one-to-many/many-to-one relationships, you might have matrices like this:
one-to-many, X=
A. thaliana O. sativa P. patens V. carteri
A. thaliana 0 1601 1614 115
O. sativa 1893 0 1686 108
P. patens 906 880 0 123
V. carteri 1693 1606 2155 0
many-to-one, Y=
A. thaliana O. sativa P. patens V. carteri
A. thaliana 0 4683 2463 5596
O. sativa 4135 0 2483 5510
P. patens 4099 4347 0 6439
V. carteri 282 269 329 0
This means there are 1693 genes in V. carteria that are in a one-to-many relationship with orthologs in A. thaliana whereas there are only 115 genes in A. thaliana that are in a one-to-many relationship with genes in V. carteria. That corresponds to what should be expected, the genome of A. thaliana is larger and there have been more gene duplication events in lineage leading to A. thaliana than to the green algae V. carteria.
A little care needs to be taken when reading these files though as the 1693 genes in volvox are orthologs of the 5596 genes in arabidopsis (i.e. X(i,j)
genes are orthologs of Y(j,i)
genes) and the 115 genes in arabidopsis are orthologs of the 282 genes in volvox. This makes sense in terms of the naming of the matrices and the ordering of the entries, but might be different from what might naively be expected.
All the best
David
Hi David,
Thank you for the comprehensive explanations. It would really be great if you could edit the README, in order to help users to better interpret those results.
Best,
Julien.
PS: I deleted the previous post by mistake. I'll just repost the one-to-many matrix here for other readers who might be interested to this issue.
Chaar Latma Pagma Parch Perfl Sanlu Silsi
Chaar 0.0 816.0 1269.0 1511.0 6740.0 597.0 871.0
Latma 431.0 0.0 1156.0 1388.0 5992.0 625.0 787.0
Pagma 1007.0 1218.0 0.0 2021.0 6088.0 738.0 1387.0
Parch 383.0 690.0 1110.0 0.0 6552.0 513.0 714.0
Perfl 197.0 441.0 772.0 706.0 0.0 448.0 430.0
Sanlu 239.0 426.0 719.0 827.0 3587.0 0.0 431.0
Silsi 441.0 822.0 1135.0 1492.0 6923.0 608.0 0.0