understanding the output of Estimate_Abundances.py and Collate.py

Question

understanding the output of Estimate_Abundances.py and Collate.py

Jigyasa3 opened this issue 3 years ago · 2 comments

Hey!

Thanks for the great software! I am following the wiki to combine the scaffold coverages from two samples.
When I run the code Estimate_Abundances.py and Collate.py, I get two files Scaffolds.fasta and Feature-Matrix-concoct.txt.

I wanted to ask-
a) Does the Scaffolds.fasta file corresponds to the "final" scaffolds file of two samples? Can I use this to run the binning software (eg-concoct)? If yes, then do I just concatenate the reads files from the two samples as concoct requires one paired-end reads file along with the scaffolds file.
b) What is Collate.py doing? I get the Feature-Matrix-concoct.txt file, but what does it mean?
c) I ran the Estimate_Abundances.py using sample1 or sample2 as the starting file for Coords_After_Delinking.txt. They generate a Scaffolds.fasta file with a very different number of scaffolds. Do I just use the one with the most number of scaffolds for (a)?

Looking forward to your reply!

Answer 1 · 2021-06-30T16:08:04.000Z

Thanks for reaching out. Metacarvel produces graph scaffolds and binnacle aims at estimating coverages on the graph scaffolds after accounting for any scaffolding errors. Typically binning softwares such as MetaBAT2, MAXBIN 2.0 and CONCOCT require abundances and a fasta file of the sequences you want to bin, as inputs. In addition to these when there are multiple samples available binning methods make use of abundances estimated from mapping the reads of all samples on to all the other samples in the dataset. For an example if there are 3 samples then for each of the 3 samples 3 sets of abundances can be estimated. To that end the python program Collate.py helps merge the different abundance summaries produced by Estimate_Abunadnaces.py.

Since you have two samples here, Sample1 and Sample2, You should run Estimate_Abundances.py on Sample1. This will establish a coordinate system and the abundances. Based on this coordinate system the abundances of Sample1 using the reads of Sample2 can be computed. The same set of steps are repeated for Sample2. On doing so, please run Collate.py on these two binnacle outputs. This would produce one Feature-Matrix-concoct.txt for each of the two samples and a Scaffolds.fasta for each of the two samples. This can be fed as inputs to bin these two samples separately using CONCOCT.

a. Each sample would have its own Scaffolds.fasta file and Feature-Matrix-conccot.txt. Yes, you can use this to bin using CONCOCT. If you are using Feature-Matrix-concoct.txt then you may specify it using -c(--coverage_file), in that case you dont have to pass the reads as an input. The scaffolds.fasta is used to compute the Composition_Table which is another input to concoct. We refer you to the usage here.

b. As mentioned earlier Collate.py is used to combine the coverage summaries estimated by mapping reads of all the samples in the dataset to the specified sample and format the output according the binning method specified.

c. Binning is done for each sample in the dataset individually. The scaffolds are different for each of these samples because these are two different assemblies. As mentioned earlier, estimating coverages of Sample1 using reads of Sample2 is used as an additional feature in binning. We recommend binning each sample individually and then dereplicating if you want one set of bins for the entire dataset. Dereplicating bins is still an active area of research and we draw your attention to https://www.nature.com/articles/s41564-018-0171-1.

Hope this clarifies your questions. Reach out if you have questions.

Answer 2 · 2021-07-02T04:02:15.000Z

Thank you so much!