marbl/binnacle

Issue with binnacle output with concoct and metabat2

FabbriniMarco opened this issue · 1 comments

Hi everyone, and thank you for your attention.

I've run through the complete Binnacle pipeline flawlessly and ran Collate.py to get a Feature-Matrix for concoct and another for metabat.

However, when it comes to feed Binnacle's output to binner algorithms i started struggling. Let's start with concoct. I'm using version 1.0.0

concoct --version

concoct 1.0.0

concoct -t 30 --composition_file Scaffolds.fasta --coverage_file Feature-Matrix-concoct.txt -b test_concoct

/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/concoct/input.py:82: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
cov = p.read_table(cov_file, header=0, index_col=0)
Traceback (most recent call last):
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/bin/concoct", line 88, in
results = main(args)
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/bin/concoct", line 40, in main
args.seed
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/concoct/transform.py", line 5, in perform_pca
pca_object = PCA(n_components=nc, random_state=seed).fit(d)
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 340, in fit
self._fit(X)
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 381, in _fit
copy=self.copy)
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/sklearn/utils/validation.py", line 573, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I checked my Feature-Matrix for NaN, NAs, Inf, but all values seem to be fine. Here the feature matrix structure (I have 12 samples in my test dataset, truncated for clarity):

head -4 Feature-Matrix-concoct.txt

Binnacle_Scaffold_1 53.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Binnacle_Scaffold_2 125.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Binnacle_Scaffold_3 63.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Binnacle_Scaffold_4 40.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

I then tested the metabat-formatted feature matrix with metabat v2.15:

metabat2 -t 30 -i Scaffolds.fasta -a Feature-Matrix-metabat.txt -o test_metabat

MetaBAT 2 (2.15 (Bioconda)) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, maxEdges 200 and minClsSize 200000. with random seed=1641895722
terminate called after throwing an instance of 'boost::wrapexceptboost::bad_lexical_cast'
what(): bad lexical cast: source type value could not be interpreted as target
Aborted

I also would like to specify that both binners work with no problems in classic bin pipelines as metaWRAP, so i think that my installation is not the issue here.

Any way i can overcome from this issue? Am i missing something? If more data are needed i would be more than willing to add it to this post.

Thanks!

Marco

I re-checked my Feature-Matrix and noticed that for few comparisons the value was missing. Going through the complete verbose output of the mapping part before the Estimate_Abundance.py step, i found out that i had some problems in disk access for the intense usage of the cluster i've been working on, with consequent skipping of the samples.

In case you experienced my same problem, carefully check for blank cells in your Feature-Matrix table.
After re-running the missing mappings for comparison between samples and sorting the coverage files, i obtained a complete Feature-Matrix.
Given as input in concoct, everything ran flawlessly, leading to the concoct *_clustering_gt1000.csv file.

As reported in Issue #4 i had to rename the Scaffolds.fasta headers and Feature-Matrix first column names accordingly to avoid ending the string with a number.
I report here the code written by Chrisjrt in issue #4 which worked perfectly for me as well:

It looks to have ran ok after I appended '_contig' to the end of the fasta headers and their associated column in the Feature-Matrix using a couple of bash one-liners and re-running it with their outputs as per below:

> sed 's/>.*/&_contig/' Scaffolds.fasta > Scaffolds_edit.fasta
> awk 'BEGIN{FS=OFS="\t"}{$1=$1"_contig"}1' Feature-Matrix-concoct.txt > Feature-Matrix-concoct_edit.txt
> concoct -t 10 --composition_file Scaffolds_edit.fasta --coverage_file Feature-Matrix-concoct_edit.txt -b test