ekimb/rust-mdbg

Recommended parameters for metagenome assembly and a related question

xfengnefx opened this issue · 5 comments

Hi,

I want to try mdBG on real metagenome samples. I wonder if you could suggest a parameter combo to use (or combos to try out). And should I do the multi-k mode?

For the real samples, I could crudely guess the number of species in the library, and perhaps an exaggerated total genome size from it as well. I'm not sure if these could be useful.

Another question is: could mdBG output contig coverage estimates?

Thank you!

Dear Xiaowen, thanks for your interest!

For mdbg on metagenomes (or in fact isolates too), there are several possible execution modes:

  1. single parameter
  2. automatic parameters (it will autodetect)
  3. multi-k

For 1., our paper experiments were made with -k 21 -l 14 --density 0.003 so it seems reasonable to try that. We never tested 2. on metagenomes but I suspect it will also give reasonable parameters. Regarding 3., rust-mdbg also has a multi-k mode but we didn't tune it for metagenomes, so I would not recommend running the current multi-k script with metagenomes.

We don't have a way to adjust parameters in terms of number of species and genome size. I suggest you just run with one of the two ways above (1. or 2.) and see if the results look reasonable. For mdbg in metagenomics, a reasonable result will be that the per-species coverage is high but contiguity is lower than hifiasm-meta.

In any case, please make sure to use the https://github.com/ekimb/rust-mdbg/blob/master/utils/magic_simplify_meta script and not the usual magic_simplify because otherwise too many contigs will be discarded.

rust-mdbg does not output contig coverage estimates. The unsimplified output GFA does have kminmer abundance, per node. That information isn't propagated to the simplified GFA, as I'm unsure how accurate it ends up being in terms of actual base coverage.

please let us know if you have any issues,

best,
Rayan

Hi Rayan,

Thank you so much for the suggestions and mentioning magic_simplify_meta. I will try the first two ways. I wasn't sure how total genome size and --density would interact, it's nice to know that this isn't a concern. I once accidentally set two parameters too low for HiCanu by not reading the docs...

I leave the issue open for now in case I may need more advises from you. I will come back and close it by next week if I don't run into anything. Thank you!

Best,
Xiaowen

Thanks a lot for the help, assembly runs were smooth. I have one additional question, not related to the issue's title though: have you tried busco (eukaryotes) or checkM (microbial) for evaluation? Could you offer some advises if so?

I tried checkM1 and it seems to be confused by insertions. I have not tried checkM2 yet.

Hi Xiaowen, great to hear.

We haven't run extensive evaluations using checkM on our rust-mdbg metagenomes, but based on feedback by a collaborator, it makes sense that rough unpolished metagenome assemblies, such as the ones produced by rust-mdbg, would have poor checkM score due to indels, provoking frameshifts, then hurting sensibility of the gene detection method thus lowering the gene completeness score.

The gene is in fact likely there in the assembly, except not detected due the need for high base quality in those assembly assessment methods. One possible workaround would be to run a polishing software such as racon on the assembly, but this is just a hypothesis.

Rayan

Awesome, thank you for the suggestions.