Usage of with eggnog-mapper2

Question

Usage of with eggnog-mapper2

Lucas-Maciel opened this issue 4 years ago · 10 comments

Lucas-Maciel commented 4 years ago

eggnog2gbk version: 0.0.7
Python version: 3.8.2
Operating System: CentOS Linux 7

Description

Hi, I'm trying to use your tool with my output from eggnog-mapper v2

What I Did

I used your test data and it worked, but not with mine.

emapper2gbk genomic -fg ../Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_cds_from_genomic.fna -fp ../Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_protein.faa -o teste.out -a Roseburia_inulinivorans_DSM16841.emapper.annotations 
The default organism name 'cellular organisms' is used.
Formatting fasta and annotation file for GCF_000174195.1_ASM17419v1_genomic
Traceback (most recent call last):
  File "/raeslab/scratch/lucmac/miniconda3/bin/emapper2gbk", line 8, in <module>
    sys.exit(cli())
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/__main__.py", line 245, in cli
    gbk_creation(genome=args.fastagenome, proteome=args.fastaprot, annot=args.annotation, gff=args.gff, org=orgnames, gbk=args.out, gobasic=args.gobasic, dirmode=directory_mode, cpu=args.cpu, metagenomic_mode=False)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/emapper2gbk.py", line 32, in gbk_creation
    fa_to_gbk.main(genome, proteome, annot, org, gbk, gobasic)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/fa_to_gbk.py", line 170, in main
    faa_to_gbk(genome_fasta, prot_fasta, annot_table, species_name, gbk_out, gobasic)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/fa_to_gbk.py", line 64, in faa_to_gbk
    annotation_data = dict(read_annotation(annotation_data))
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/utils.py", line 269, in read_annotation
    annotation_data.columns = headers_row
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5475, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 669, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 220, in set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 24 elements, new values have 1 elements

# Fri Feb 12 12:56:02 2021
# emapper-2.0.6
# emapper.py -i Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_protein.faa --cpu 4 --itype proteins -m diamond --output_dir eggnog --output Roseburia_inulinivorans_DSM16841 
#
#query_name     seed_eggNOG_ortholog    seed_ortholog_evalue    seed_ortholog_score     eggNOG OGs   narr_og_name     narr_og_cat     narr_og_desc    best_og_name    best_og_cat     best_og_desc    Preferred_name        GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass  BRITE    KEGG_TC CAZy    BiGG_Reaction   PFAMs

Answer 1 · 2021-02-23T13:21:22.000Z

Hi Lucas, sorry for the delay.

What you describe is likely to a bug brought by the changes since the latest release.

Unfortunately, we don't have the latest releases (2.0.x) of emapper installed on our servers yet, and it appears that the online version of eggnog-mapper does not have the latest release in production either.

Is there a way for you to run emapper on our test data and share the .emapper.annotations file with us so we could fix the bug?

Answer 2 · 2021-03-02T20:40:38.000Z

Hi @cfrioux . Here is the file you asked

betaox.emapper.zip

Thank you very much

Answer 3 · 2021-03-04T14:14:52.000Z

Hi @Lucas-Maciel,

I have push a commit on the genomic_update branch that should fix this issue:

https://github.com/AuReMe/emapper_to_gbk/tree/genomic_update

Can you test it?

Answer 4 · 2021-04-22T23:55:33.000Z

Hello, I'm having the same issue. Was this ever resolved?

Answer 5 · 2021-04-22T23:58:41.000Z

Actually I'm getting a different error too, seems to be something related to simplejson? I'm uploading my files now with the call and error message in a txt file.

emapper2gbk_test.zip

Answer 6 · 2021-04-23T09:39:15.000Z

Hi @kieft1bp-sys,

Hello, I'm having the same issue. Was this ever resolved?

This issue has been resolved in the genomic_update branch of emapper2gbk.

Actually I'm getting a different error too, seems to be something related to simplejson? I'm uploading my files now with the call and error message in a txt file.

emapper2gbk_test.zip

Sorry for this error message, I am currently adding a more user friendly message in the new version.

This error is linked to the argument '-n "AB48"' in your command line. The '-n' argument expects a complete taxon name. For example you will get the same error if you put '-n "K-12"' instead of '-n "Escherichia coli K-12"'. But if you have no genus or species name, you can put a family name for example '-n "Enterobacteriaceae"'.

If you want to check if your taxon name is correct, you can check if using this http (which is the one used by emapper2gbk):
https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/

For example with 'Escherichia coli' (and replacing ' ' by '%20'):
https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Escherichia%20coli

If it sends you a "No Results" message it is because either the taxon name is not in the database or there is a typo error in your taxon name.

Also it seems that you use a new version of eggnog-mapper (2.1.2) that changes the format of the output. So with the current version of emapper2gbk it will not work. I have pushed a new commit on the genomic_update branch that should fixed this issue.

But I think there will still be an issue: the nucleic fasta you provided contains gene sequences and not genome (chromosome) sequences so with the genome mode it will not work and the GFF file does not have a compatible format with the one expected in emapper2gbk (presented here).

If you use the genomic_update branch, you could obtain a genbank file with your "AB48.fna", "AB48.faa" and "AB48.emapper.annotations.tsv" by using the command:

emapper2gbk genes -fn AB48.fna -fp AB48.faa -a AB48.emapper.annotations.tsv -o AB48.gbk -n "Taxon name"

Answer 7 · 2021-04-23T15:10:18.000Z

Thanks for the extensive answer! I'll try out your suggestions today.

Answer 8 · 2021-04-24T00:09:01.000Z

I tried using the last command you suggested after installing the new branch and the program runs fine but does not bring in any annotations from the emapper annotations file (see attached .gbk).

AB48.gbk.txt

Answer 9 · 2021-04-24T00:23:54.000Z

Also, I modified my .gff file according to the format you linked to (see attached .gff) and tried running in "genomes" mode with my correct genome assembly .fna file (see attached .fna). It ran fine but produced an odd-looking gbk file (attached .gbk), so maybe my reformatting didn't help. (adding .txt to all file extensions because github needs it).

AB48_genomes_mode.gbk.txt
AB48_genome.fna.txt
AB48_updated.gff.txt

Answer 10 · 2021-04-26T08:52:29.000Z

I tried using the last command you suggested after installing the new branch and the program runs fine but does not bring in any annotations from the emapper annotations file (see attached .gbk).

AB48.gbk.txt

emapper2gbk will only extract GO Terms, EC number and gene name from the eggnog-mapper file. If genes have not these annotations, they will be not be annotated in the genbank. For example, the first 3 genes in the genbank file are not annotated because they have no GO Terms, EC numbers and gene name in the eggnog-mapper annotation file.

But if you move down in the file, you can see that the gene "contig_5_1000" is annotated. Or you can search in the file for "go_component", "gene", "go_function", "go_process" or "EC_number" to find annotations from eggnog-mapper.

Also, I modified my .gff file according to the format you linked to (see attached .gff) and tried running in "genomes" mode with my correct genome assembly .fna file (see attached .fna). It ran fine but produced an odd-looking gbk file (attached .gbk), so maybe my reformatting didn't help. (adding .txt to all file extensions because github needs it).

AB48_genomes_mode.gbk.txt
AB48_genome.fna.txt
AB48_updated.gff.txt

In this genbank, there is no annotation and no protein sequences associated to genes. I think it is because when you have updated the GFF file, the ID of the CDS does not match the ID in the "AB48.fna" and "AB48.emapper.annotations.tsv" files.
For example in the GFF file: "cds-contig_5_1" is the CDS ID for "contig_5_1". So emapper2gbk will search for the ID "cds-contig_5_1" in the "AB48.fna" and in the "AB48.emapper.annotations.tsv". But it will not find it as in these files it is still labelled "contig_5_1".

Updating both "AB48.fna" and "AB48.emapper.annotations.tsv" with the "cds-contig" ID should fix this.