ncbi/pgap

[BUG] Genus species requirement

rsieskind opened this issue · 8 comments

Describe the bug
A clear and concise description of what the bug is.

When I want to use pgap for structurally and functionally annotate the new bacteriophage genome that I recently reconstructed, I am forced to provide the -s organism option. When I enter -s "Cobetia marina Bacteriophage 5" I get the error message "Unknown organism Cobetia marina Bacteriophage 5"

To Reproduce
If you are having trouble with your genome, please ensure that you can run the pipeline with one of our test genomes first. If your installation works fine with the sample input, please tell us if you are willing and able to share your genome with us, if asked.

I made the quickstart guide mycoplasmoides-based example work properly. I can share my genome if needed.

Expected behavior
A clear and concise description of what you expected to happen.

I wanted the pipeline to detect genes and to search for their potential functions from scratch. Does pgap really need a reference organism to work?

Software versions (please complete the following information):

  • OS: [e.g. CentOS 7, Windows 10, etc.]: RedHat 8

  • pgap.py --version, or docker image version: pgap_2023-05-17.build6771.sif

  • Docker (or other container runner) version. [e.g. docker --version]: apptainer version 1.3.1

Log Files
Please rerun pgap.py with the --debug flag and attach an archive (e.g. zip or tarball) of the logs in the directory: debug/tmp-outdir/*/*.log.

Carin5_results5_debug.log

Additional context
Add any other context about the problem here.

NA

Thank you, user @rsieskind for your report and following the proposed format! Truly appreciated!

As for the essence of the error, "Cobetia marina Bacteriophage 5" does not seem like a proper prokaryotic species name (says "phage"). It is even not present in our taxonomy database (I checked using our NCBI gettax application)

Also: I would recommend to upgrade your pgap version to the most recent one (you are using a version from last May, we had two releases after that) since we support only the latest release.

Thank you @azat-badretdin for your rapid answer.

The update of PGAP on the cluster I used will take time so I tried to install the last version locally (2024-04-27.build7426) and I launched a run with the inputs transferred previously by mail (command: './pgap.py -r -o genus_results1 genus.yaml > genus_results1.log') and hereafter are the logs.
genus_results1.log
cwltool.log
I have the Docker version 1.13.1, build 7d71120/1.13.1

This time, I get the same error with the mycoplasmoides-based quick start example.

Apparently the --platform flag is not supported. I tried to comment the line calling it in the .py, but an update change the file back every time.

If we can fix this new bug, we may come back to the previous problem and I have an important question: Is PGAP able to annotate a completely new genus species?

Thank you, Rémi. The first log reads:

PGAP version 2024-04-27.build7426 is up to date.
Output will be placed in: /home/rsieskin/pgap-master/scripts/C5/Carin5_results1
PGAP failed, docker exited with rc = 125
Unable to find error in log file.

Also cwltool.log shows that execution seemingly does not even reach the stage of execution of CWL workflows.

This brings the focus to what is happening with your container runner. From this

apptainer version 1.3.1

I am concluding that under the disguise of docker you are actually running singularity which has been renamed to apptainer recently. Please try specifying --docker /your/path/to/singularity

Also: as you're aware we simplified user input and now instead of YAML file the user can supply two simple parameters: -s Taxonomy item and -g path/to/fasta/file, in this particular ticket this will eliminate the necessity of posting here YAML file in case something happens in the middle of actual execution.

Dear Azat,

I installed singularity on my machine and relaunched a job with the command ./pgap.py -r -o genus_results2 --docker singularity genus.yaml > genus_results2.log.

genus_results2.log
cwltool.log

It is now working for the mycoplasmoides-based quick start example, but still not for my data. We are thus back to the previous problem.

I understand that the -s flag is facilitating the use of pgap.py, but I still have the problem that my genome is a brand new genome that has no closely-related organism. So, which genus should I give so that the annotation starts?

Hello.
Describe the bug
My but is bug.

I want to use pgap for structurally annotation. I am working with Genus Actinomyces. I am forced to provide the -s organism option. When I enter only -s "Actinomyces" I get the error message "Fall to complete".

The documentation shows the possibility to put only the Genus, but it is false. In order to complete the job, you need the Genus and Species.

Could you help me?

User @cristoyerenahs could you please open a separate ticket? Thanks!

, which genus should I give so that the annotation starts?

@rsieskind

You seem to be trying to annotate a bacteriophage, no?

PGAP (Prokaryotic Genome Annotation Pipeline) is designed and optimized to annotate bacteria and archaea, not viruses or phage which is why your bacteriophage does not exist in the organism database. Prophages which are incorporated into a bacterial genome (like the well-studied prophages of Salmonella) are annotated using the bacterial genus species designation. If you choose, you can use Cobetia marina as the organism but be aware that the results may be questionable.