extract best OG (v2.1.12)

Question

extract best OG (v2.1.12)

algrgr opened this issue a year ago · 4 comments

Hello all,

I can't seem to get an easy solution to extract the best OG from the output table.
There are two columns in the output table: "eggNOG_OGs" and "max_annot_lvl"

Given this info, what is the easy way to extract "3NX5D"?
So, the script needs to check the match from one column (max_annot_lvl) in another (eggNOG_OGs) and extract partial string between , and @ (3NX5D).

Any ideas would be appreciated.

Btw, it would be nice to add "bestOG" field in the output, like it was in previous versions..

cheers,
alex

Answer 1 · 2023-10-10T08:53:26.000Z

Hi @algrgr ,

For instance, you may split the eggNOG_OGs field by ",". Then split by "@". Put in a dictionary (if using Python) as key the right half (4751|Fungi) and as value the left half (3NX5D). Search the "max_annot_lvl" value in the dictionary.

I hope this is of help.

Best,
Carlos

Answer 2 · 2023-10-10T09:23:46.000Z

Hello @Cantalapiedra ,

Thanks for reply! Since I don't have good skills for such complicated parsing, m'colleague came up with R script that does this extraction (frankly, I'd prefer one-liner awk solution, but ok...)
Still, perhaps you could output it in separate field in the future versions of eggNOG? This will save some headache for not-so-skillful people like me : )
And thanks much for the software, btw!

cheers,
alex

Answer 3 · 2023-10-10T10:23:59.000Z

It should be easy to do with awk with 2 splits and one for loop. You may do it to practice ;) or do something similar to:

cat TEST.emapper.annotations | grep -v "^#" | 
awk -F $'\t' '{split($5, v, ","); for (a in v) {split(v[a], w, "@"); if (w[2]==$6) print w[1]}}'

To make it easier to parse we may need to provide an additional file, since we are trying to avoid changing the output format as much as we can in recent versions. But yes, there are different things we may add to make life easier for users and downstream analyses. Thank you for the suggestion!

Answer 4 · 2023-10-11T13:59:05.000Z

@Cantalapiedra thanks indeed for that piece of script. That is more efficient!