ncbi/pgap

[BUG] <title>

Artifice120 opened this issue · 3 comments

Describe the bug
While trying to run PGAP on a new Rickettsia genome I am forced to include a species. Since there are no assemblies for this species of rickettsia I used two recommended tax-check flags as seen bellow

PGAP_INPUT_DIR=/lustre/isaac/scratch/jtorre28/pgap

./pgap.py -r -o rick_results -g /lustre/isaac/scratch/jtorre28/spades/test_out/contamination_screening/rick3.contigs.filt.fa -s 'Rickettsia japonica' --taxcheck --auto-correct-tax --debug

However, I get the following error

Filesystem                                      1K-blocks          Used    Available Use% Mounted on
172.31.0.24@o2ib:172.31.0.26@o2ib:/isaaclfs 3954753932040 2934672204344 820492104044  79% /lustre/isaac
Output will be placed in: /lustre/isaac/scratch/jtorre28/pgap/rick_results
PGAP version 2024-07-18.build7555 is up to date.
installation directory: /lustre/isaac/scratch/jtorre28/pgap
Skipping already installed tarball: https://s3.amazonaws.com/pgap/input-2024-07-18.build7555.tgz
Singularity sif files exists, not updating.
Downloading and extracting tarball: https://s3.amazonaws.com/pgap/input-2024-07-18.build7555.ani.tgz
WARNING: open files is less than the recommended value of 8000
TAXCHECK completed successfully.
DEBUG: args.output = rick_results
DEBUG: params.outputdir = /lustre/isaac/scratch/jtorre28/pgap/rick_results
ERROR: taxcheck failed to assign a species with high confidence, thus PGAP will not execute. See /lustre/isaac/scratch/jtorre28/pgap/rick_results/ani-tax-report.txt

This is the tax-file

ANI report for assembly: rick3.contigs.filt.fa
Submitted organism: Rickettsia japonica (taxid = 35790, rank = species, lineage = Bacteria; Pseudomonadota; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Rickettsieae; Rickettsia; spotted fever group)
Best match: Rickettsia bellii (taxid = 33990, rank = species, lineage = Bacteria; Pseudomonadota; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Rickettsieae; Rickettsia; belli group)
Submitted organism has type: Yes
Status: INCONCLUSIVE
Confidence: LOW

Table legend:
ANI : ANI value between this assembly and the type listed in this row
(Coverages) : query-coverage and subject-coverage of this assembly (query) and the type (subject)
NewSeq : the count of bases best assigned to this type assembly
CntmSeq : the portion of NewSeq allocated for purposes of evaluating contamination
Flg : Type flags; currently: C = contaminant; E = effectively published; T = trusted species
Assembly : Release-id of the type-assembly (this value matches the accession and assembly-name on the right column)
Organism : Organism of this type-assembly
(assembly_accession, assembly_name) : of this type-assembly

ANI     (Coverages)   NewSeq   CntmSeq  Assembly  Flg Organism  (assembly_accession, assembly_name)
------- ------------- -------- -------- --------- --- --------------------------------------------------------------------
 93.315 ( 64.4  61.7)   890567   890567     13188     Rickettsia bellii RML369-C (GCA_000012385.1, ASM1238v1)
 83.956 ( 16.9  17.9)     6461     3501   3134898     Rickettsia asembonensis (GCA_000828125.2, ASM82812v2)
 83.918 ( 16.4  14.4)     5415      661  21983708     Rickettsia tamurae subsp. buchneri (GCA_000696365.2, REISMNv1)
 83.933 ( 16.2  16.3)     4648     4648   1199088     Rickettsia tamurae (GCA_000751075.1, Rickettsia tamurae AT-1)
 83.850 ( 17.0  16.7)    15624      671   1655938     Rickettsia conorii subsp. raoultii (GCA_000940955.1, ASM94095v1)
 83.776 ( 16.2  16.3)     3299     3299   6004488     Rickettsia fournieri (GCA_900243065.1, PRJEB23962)
 83.701 ( 17.8  17.5)     3156     3156   1485538     Rickettsia hoogstraalii (GCA_000825685.1, Rickettsia hoogstraalii Croatica)
 83.714 ( 17.0  17.2)     1930     1930  37927588     Rickettsia tillamookensis (GCA_016743795.2, ASM1674379v2)
 83.813 ( 15.6  16.8)      492      492   1720158     Rickettsia monacensis (GCA_000499665.2, RMONA_1)
 83.584 ( 15.7  17.8)    43930    43930    406738     Rickettsia japonica YH (GCA_000283595.1, ASM28359v1)
 83.520 ( 14.5  16.7)      118      118   1526588     Rickettsia rickettsii str. Iowa (GCA_000017445.3, ASM1744v3)
 83.635 ( 15.6  16.8)      108      108    834068     Rickettsia gravesii BWI-1 (GCA_000485845.1, RicGra1.0)
 83.561 ( 15.4  17.6)        0        0    296048     Rickettsia conorii subsp. heilongjiangensis 054 (GCA_000221205.1, ASM22120v1)
 83.566 ( 15.4  17.7)        0        0    380228     Rickettsia honei RB (GCA_000263055.1, Rho1.0)
 83.602 ( 14.5  16.8)        0        0    432068     Rickettsia sibirica subsp. mongolitimonae HA-91 (GCA_000247625.2, ASM24762v2)
 83.595 ( 15.7  17.9)        0        0    864348     Rickettsia japonica YH (GCA_000302635.2, ASM30263v2)
 83.520 ( 14.5  16.7)        0        0   3973358     Rickettsia rickettsii (GCA_001950995.1, ASM195099v1)
 83.520 ( 14.5  16.7)        0        0   3973378     Rickettsia rickettsii (GCA_001951015.1, ASM195101v1)
 83.467 ( 14.9  16.4)     1776      324    392728     Rickettsia australis str. Phillips (GCA_000273745.1, Rau1.0)
 83.442 ( 15.6  17.9)      631      631      7828     Rickettsia conorii str. Malish 7 (GCA_000007025.1, ASM702v1)
 83.497 ( 15.9  18.2)        0        0    320558     Rickettsia slovaca 13-B (GCA_000237845.1, ASM23784v1)
 83.423 ( 15.0  17.4)        0        0    377238     Rickettsia conorii subsp. caspia A-167 (GCA_000261325.1, RcoCa1.0)
 83.406 ( 15.1  17.6)        0        0    381518     Rickettsia conorii subsp. israelensis ISTT CDC1 (GCA_000263815.1, RcoIs1.0)
 83.366 ( 15.4  17.2)        0        0    407678     Rickettsia rhipicephali str. 3-7-female6-CWPP (GCA_000284075.1, ASM28407v1)
 83.282 ( 15.6  18.2)       97       97    202718     Rickettsia sibirica 246 (GCA_000166935.1, ASM16693v1)
 82.808 ( 10.7  14.0)      766      766    591868     Rickettsia prowazekii str. Breinl (GCA_000367405.1, ASM36740v1)
 83.407 (  8.9  11.7)      300      300      8848     Rickettsia typhi str. Wilmington (GCA_000008045.1, ASM804v1)

Is there a way run pgap under the closest species "Rickettsia bellii" Then have it just keep all the "contaminating" sequences?

At the very least a gene prediction file on its own would be just as good. Then I could do the homology searches myself.

Additional context

This is a new Ricketssia species so none of the NCBI references will match well.

Thank you for your report, user @Artifice120 !

Is there a way run pgap under the closest species "Rickettsia bellii" Then have it just keep all the "contaminating" sequences?

Sounds like a reasonable plan. If you add --ignore-all-errors to your command line you might be able to get through the end.

Thanks,

Seems to have finished with all outputs and a CheckM completeness of 97%. There are excessive gene predictions but that is expected.

You are welcome, user @Artifice120 !