merenlab/anvio

[BUG] MaxBin2 failing due to one missing coverage

Opened this issue · 2 comments

Short description of the problem

Hello again. I found a rather bizarre case in my testing and I'm a bit stuck now. I'm trying to run anvi-cluster-contigs with MaxBin2, and it's failing due to a single missing coverage value. I confirmed that the /tmp files indeed are missing that value, and it's precisely the last one (last contig), which is very suspicious. Strangely enough, I only have this issue with one out of three samples.

Any help is greatly appreciated!

anvi'o version

Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.8

Profile database .............................: 40
Contigs database .............................: 23
Pan database .................................: 17
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

OS: Rocky Linux 8.6 (Green Obsidian). Installed using the instructions for developer version.

Detailed description of the issue

I'm running a metagenome workflow using three samples, and only one of them causes the issue. The specific step is:

anvi-cluster-contigs -c 03_CONTIGS/d071_br05_S7-contigs.db -p 06_MERGED/d071_br05_S7/PROFILE.db --driver maxbin2 --collection-name maxbin2 --just-do-it

In the temporary folder:

[tmp87wnge7z]$ ls
contig_coverages_log_norm.txt  MAXBIN_.contig.tmp.abund1    MAXBIN_.contig.tmp.frag.out  sequence_splits.fa
contig_coverages.txt           MAXBIN_.contig.tmp.frag.err  MAXBIN_.log.tmp              split_coverages_log_norm.txt
logs.txt                       MAXBIN_.contig.tmp.frag.faa  MAXBIN_.tooshort             split_coverages.txt
MAXBIN_.contig.tmp             MAXBIN_.contig.tmp.frag.ffn  sequence_contigs.fa

[tmp87wnge7z]$ wc -l contig_coverages.txt 
34990 contig_coverages.txt

[tmp87wnge7z]$ grep -c '^>' sequence_contigs.fa 
34990

As you can see, the sequence_contigs.fa has 34,990 sequences, but the contig_coverages.txt file only has values for 34,989 contigs (plus the header line). The one missing is the very last one:

[tmp87wnge7z]$ grep '^>' sequence_contigs.fa | tail
>d071_br05_S7_000000034981
>d071_br05_S7_000000034982
>d071_br05_S7_000000034983
>d071_br05_S7_000000034984
>d071_br05_S7_000000034985
>d071_br05_S7_000000034986
>d071_br05_S7_000000034987
>d071_br05_S7_000000034988
>d071_br05_S7_000000034989
>d071_br05_S7_000000034990

[tmp87wnge7z]$ tail contig_coverages.txt 
d071_br05_S7_000000034980	4.011795281887245	7.129548180727709	4.939624150339864
d071_br05_S7_000000034981	4.126349460215914	5.806077568972411	5.666933226709316
d071_br05_S7_000000034982	3.5199920031987206	2.8886445421831266	3.0765693722510994
d071_br05_S7_000000034983	183.91561687662468	345.2587482503499	260.255548890222
d071_br05_S7_000000034984	35.42771445710858	66.32173565286942	69.4127174565087
d071_br05_S7_000000034985	10.773045390921816	16.419516096780644	9.803239352129575
d071_br05_S7_000000034986	7.404319136172766	12.067986402719455	8.539092181563687
d071_br05_S7_000000034987	6.265146970605879	3.786042791441712	8.753449310137972
d071_br05_S7_000000034988	4.5730853829234155	8.680263947210557	8.938212357528494
d071_br05_S7_000000034989	4.8028394321135774	3.6392721455708856	3.305138972205559

After a few minutes, the command above returns:

✖ anvi-cluster-contigs encountered an error after 0:11:43.825390


Config Error: Some critical output files are missing. Please take a look at the log file:
              /tmp/tmp87wnge7z/logs.txt                                                  

And the logs.txt file has:

# DATE: 23 Jul 24 20:37:22
# CMD LINE: run_MaxBin.pl -contig /tmp/tmp87wnge7z/sequence_contigs.fa -abund /tmp/tmp87wnge7z/contig_coverages.txt -out /tmp/tmp87wnge7z/MAXBIN_ -thread 1
MaxBin 2.2.7
Input contig: /tmp/tmp87wnge7z/sequence_contigs.fa
Located abundance file [/tmp/tmp87wnge7z/contig_coverages.txt]
out header: /tmp/tmp87wnge7z/MAXBIN_
Thread: 1
Searching against 107 marker genes to find starting seed contigs for [/tmp/tmp87wnge7z/sequence_contigs.fa]...
Running FragGeneScan....
Running HMMER hmmsearch....
Done data collection. Running MaxBin...
Command: /gpfs/gpfs1/scratch/c9881009/local/.conda/envs/anvio-dev/opt/MaxBin-2.2.7/src/MaxBin -fasta /tmp/tmp87wnge7z/MAXBIN_.contig.tmp  -abund /tmp/tmp87wnge7z/MAXBIN_.contig.tmp.abund1 -seed /tmp/tmp87wnge7z/MAXBIN_.seed -out /tmp/tmp87wnge7z/MAXBIN_ -min_contig_length 1000 
Failed to get Abundance information for contig [d071_br05_S7_000000034990] in file [/tmp/tmp87wnge7z/MAXBIN_.contig.tmp.abund1]
Error encountered while running core MaxBin program. Error recorded in /tmp/tmp87wnge7z/MAXBIN_.log.
Program Stop.

I checked using other drivers (e.g., CONCOCT or MetaBat2), and the same issue is present (i.e., one missing coverage value) but it simply doesn't fail because those programs silently ignore missing values (my guess).

Files / commands to reproduce the issue

Command

anvi-cluster-contigs -c 03_CONTIGS/d071_br05_S7-contigs.db -p 06_MERGED/d071_br05_S7/PROFILE.db --driver maxbin2 --collection-name maxbin2 --just-do-it

Files

The files are relatively big, so I'm making them available only temporarily here (apologies if you find this issue years from now and want to reproduce it!):

  • 03_CONTIGS/d071_br05_S7-contigs.db (927M)
  • 06_MERGED/d071_br05_S7/PROFILE.db (1.6G)

Thank you!
Miguel.

Some (potentially) important additional information: I just noticed that the "missing coverage" is for a contig shorter than the limit I had set. I ran anvi-profile with --min-contig-length 5000, but this contig is 4,167 bp in length, so it shouldn't even be here. Any ideas on how it made it through?

I'm using this config file for anvi-run-eworkflow: https://gist.github.com/lmrodriguezr/65cf33cb0ec5706a348d0189da87b63b

Thanks!
Miguel.

Apologies for the frustration here, @lmrodriguezr. My general response for #2309 sadly applies here as well. But there is certainly a bug here given you mentioned this:

Some (potentially) important additional information: I just noticed that the "missing coverage" is for a contig shorter than the limit I had set. I ran anvi-profile with --min-contig-length 5000, but this contig is 4,167 bp in length, so it shouldn't even be here. Any ideas on how it made it through?

I think the problem here likely stems from anvi-cluster-contigs reading from the contigs-db rather than profile-db to figure out which contigs to report. That will always lead to an issue since you may have more contigs in contigs-db compared to the linked profile-db as a function of flags like --min-contig-length that excludes some contigs from the profiled results.

One quick question: I presume there are many more contigs in the contigs-db that are shorter than 4,167, right? Because if that is the case, perhaps the bug is not coming from where I think it is :)