[BUG] MaxBin2 failing due to one missing coverage
Opened this issue · 2 comments
Short description of the problem
Hello again. I found a rather bizarre case in my testing and I'm a bit stuck now. I'm trying to run anvi-cluster-contigs
with MaxBin2, and it's failing due to a single missing coverage value. I confirmed that the /tmp
files indeed are missing that value, and it's precisely the last one (last contig), which is very suspicious. Strangely enough, I only have this issue with one out of three samples.
Any help is greatly appreciated!
anvi'o version
Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.8
Profile database .............................: 40
Contigs database .............................: 23
Pan database .................................: 17
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2
System info
OS: Rocky Linux 8.6 (Green Obsidian). Installed using the instructions for developer version.
Detailed description of the issue
I'm running a metagenome workflow using three samples, and only one of them causes the issue. The specific step is:
anvi-cluster-contigs -c 03_CONTIGS/d071_br05_S7-contigs.db -p 06_MERGED/d071_br05_S7/PROFILE.db --driver maxbin2 --collection-name maxbin2 --just-do-it
In the temporary folder:
[tmp87wnge7z]$ ls
contig_coverages_log_norm.txt MAXBIN_.contig.tmp.abund1 MAXBIN_.contig.tmp.frag.out sequence_splits.fa
contig_coverages.txt MAXBIN_.contig.tmp.frag.err MAXBIN_.log.tmp split_coverages_log_norm.txt
logs.txt MAXBIN_.contig.tmp.frag.faa MAXBIN_.tooshort split_coverages.txt
MAXBIN_.contig.tmp MAXBIN_.contig.tmp.frag.ffn sequence_contigs.fa
[tmp87wnge7z]$ wc -l contig_coverages.txt
34990 contig_coverages.txt
[tmp87wnge7z]$ grep -c '^>' sequence_contigs.fa
34990
As you can see, the sequence_contigs.fa
has 34,990 sequences, but the contig_coverages.txt
file only has values for 34,989 contigs (plus the header line). The one missing is the very last one:
[tmp87wnge7z]$ grep '^>' sequence_contigs.fa | tail
>d071_br05_S7_000000034981
>d071_br05_S7_000000034982
>d071_br05_S7_000000034983
>d071_br05_S7_000000034984
>d071_br05_S7_000000034985
>d071_br05_S7_000000034986
>d071_br05_S7_000000034987
>d071_br05_S7_000000034988
>d071_br05_S7_000000034989
>d071_br05_S7_000000034990
[tmp87wnge7z]$ tail contig_coverages.txt
d071_br05_S7_000000034980 4.011795281887245 7.129548180727709 4.939624150339864
d071_br05_S7_000000034981 4.126349460215914 5.806077568972411 5.666933226709316
d071_br05_S7_000000034982 3.5199920031987206 2.8886445421831266 3.0765693722510994
d071_br05_S7_000000034983 183.91561687662468 345.2587482503499 260.255548890222
d071_br05_S7_000000034984 35.42771445710858 66.32173565286942 69.4127174565087
d071_br05_S7_000000034985 10.773045390921816 16.419516096780644 9.803239352129575
d071_br05_S7_000000034986 7.404319136172766 12.067986402719455 8.539092181563687
d071_br05_S7_000000034987 6.265146970605879 3.786042791441712 8.753449310137972
d071_br05_S7_000000034988 4.5730853829234155 8.680263947210557 8.938212357528494
d071_br05_S7_000000034989 4.8028394321135774 3.6392721455708856 3.305138972205559
After a few minutes, the command above returns:
✖ anvi-cluster-contigs encountered an error after 0:11:43.825390
Config Error: Some critical output files are missing. Please take a look at the log file:
/tmp/tmp87wnge7z/logs.txt
And the logs.txt
file has:
# DATE: 23 Jul 24 20:37:22
# CMD LINE: run_MaxBin.pl -contig /tmp/tmp87wnge7z/sequence_contigs.fa -abund /tmp/tmp87wnge7z/contig_coverages.txt -out /tmp/tmp87wnge7z/MAXBIN_ -thread 1
MaxBin 2.2.7
Input contig: /tmp/tmp87wnge7z/sequence_contigs.fa
Located abundance file [/tmp/tmp87wnge7z/contig_coverages.txt]
out header: /tmp/tmp87wnge7z/MAXBIN_
Thread: 1
Searching against 107 marker genes to find starting seed contigs for [/tmp/tmp87wnge7z/sequence_contigs.fa]...
Running FragGeneScan....
Running HMMER hmmsearch....
Done data collection. Running MaxBin...
Command: /gpfs/gpfs1/scratch/c9881009/local/.conda/envs/anvio-dev/opt/MaxBin-2.2.7/src/MaxBin -fasta /tmp/tmp87wnge7z/MAXBIN_.contig.tmp -abund /tmp/tmp87wnge7z/MAXBIN_.contig.tmp.abund1 -seed /tmp/tmp87wnge7z/MAXBIN_.seed -out /tmp/tmp87wnge7z/MAXBIN_ -min_contig_length 1000
Failed to get Abundance information for contig [d071_br05_S7_000000034990] in file [/tmp/tmp87wnge7z/MAXBIN_.contig.tmp.abund1]
Error encountered while running core MaxBin program. Error recorded in /tmp/tmp87wnge7z/MAXBIN_.log.
Program Stop.
I checked using other drivers (e.g., CONCOCT or MetaBat2), and the same issue is present (i.e., one missing coverage value) but it simply doesn't fail because those programs silently ignore missing values (my guess).
Files / commands to reproduce the issue
Command
anvi-cluster-contigs -c 03_CONTIGS/d071_br05_S7-contigs.db -p 06_MERGED/d071_br05_S7/PROFILE.db --driver maxbin2 --collection-name maxbin2 --just-do-it
Files
The files are relatively big, so I'm making them available only temporarily here (apologies if you find this issue years from now and want to reproduce it!):
03_CONTIGS/d071_br05_S7-contigs.db
(927M)06_MERGED/d071_br05_S7/PROFILE.db
(1.6G)
Thank you!
Miguel.
Some (potentially) important additional information: I just noticed that the "missing coverage" is for a contig shorter than the limit I had set. I ran anvi-profile
with --min-contig-length 5000
, but this contig is 4,167 bp in length, so it shouldn't even be here. Any ideas on how it made it through?
I'm using this config file for anvi-run-eworkflow
: https://gist.github.com/lmrodriguezr/65cf33cb0ec5706a348d0189da87b63b
Thanks!
Miguel.
Apologies for the frustration here, @lmrodriguezr. My general response for #2309 sadly applies here as well. But there is certainly a bug here given you mentioned this:
Some (potentially) important additional information: I just noticed that the "missing coverage" is for a contig shorter than the limit I had set. I ran
anvi-profile
with--min-contig-length 5000
, but this contig is 4,167 bp in length, so it shouldn't even be here. Any ideas on how it made it through?
I think the problem here likely stems from anvi-cluster-contigs
reading from the contigs-db rather than profile-db to figure out which contigs to report. That will always lead to an issue since you may have more contigs in contigs-db compared to the linked profile-db as a function of flags like --min-contig-length
that excludes some contigs from the profiled results.
One quick question: I presume there are many more contigs in the contigs-db that are shorter than 4,167, right? Because if that is the case, perhaps the bug is not coming from where I think it is :)