xfengnefx/hifiasm-meta

hifiasm-meta produces redundant assemblies?

Closed this issue · 6 comments

Hello,

I performed de novo assembly on two human faecal metagenomes sequenced with PacBio Sequel II.
I tested metaFlye (2.9-b1768) and hifiasm-meta (v0.2.1).
As you can see below, hifiasm-meta produces much larger assemblies.

I mapped on the PacBio assemblies Illumina paired-end reads obtained from the same samples.
Even if the assemblies of hifiasm_meta are much larger, the proportion of mapped reads only increases slightly.
In addition, the proportion of reads aligned exactly 1 time is much lower.
This suggests that hifiasm-meta produces redundant assemblies.
What do you think?

Thanks for you help,
Florian

Donor 1

metaFlye hifiasm_meta
assembly size (bp) 596 522 308 831 187 874
# contigs 9 253 15 586
N50 (bp) 164 736 132 052
% illumina reads aligned concordantly exactly 1 time 50.79 39.45
% illumina reads aligned concordantly > 1 time 23.50 38.31
% illumina reads aligned concordantly 74.29 77.76

Donor 2

metaFlye hifiasm_meta
assembly size (bp) 264 656 715 551 812 461
# contigs 3 836 17 080
N50 (bp) 243 801 44 732
% illumina reads aligned concordantly exactly 1 time 55.28 20.34
% illumina reads aligned concordantly > 1 time 33.15 74.26
% illumina reads aligned concordantly 88.43 94.6

I think long contigs (>1Mb) should be non-redundant for both metaflye and hifiasm-meta. We measured this by pairwise mash distance. For long circular contigs, the mash distance is usual no less than 0.01. (edit: by pairwise, I mean within-assembly pairwise, not comparing metaflye and hifiasm-meta.)

For shorter contigs, the redundancy depends on phasing. Phased contigs still share sequences, and I guess they are often longer than SR. I think measuring the real duplication rate with SR alignment is difficult.

If the length distribution of the hifi reads is wide, it is possible that hifiasm-meta produces some redundant contigs due to contained read rescue. But this should be rare.

By the way, what is size of the two libraries? The second hifiasm-meta assembly looks fragmented, do you have the median or N50 read QS?

lh3 commented

Hifiasm-meta sometimes may completely separate strains with a couple of percent divergence. Many Illumina reads would be multiply mapped to such strains.

Below are the sequencing statistics

Donor 1 Donor 2
# reads 1,645,079 1,591,198
read length (bp) (Q1;med;Q3) 5,936; 7,620; 9,635 6,515; 8,626; 10,872
cumulative length (bp) 13,012,430,198 14,075,271,142

Donor 2 had a much lower alpha diversity. One species is very abundant and overwhelms the others.

Edit: median QV is 40.

Thanks, the read length distribution seems wide and also on the shorter side for hifi. Donor 2 is harder, hopefully there still are a few long conigs coming out..

I agree with Heng's take about redundancy.

I mapped back PacBio reads to their corresponding assemblies and computed error rates with samtools stats by keeping primary alignments only

There is a slight difference between hifiasm_meta and metaFlye for donor1 but metaFlye error rate is much higher for donor2.

hifiasm_meta metaFlye
donor1 5.349535e-03 7.391400e-03
donor2 3.940063e-03 1.588192e-02

This indeed suggests that metaFlye collapsed closely related abundant strains.

About donor2, you might want to check the assembly graph with bandage. For hifiasm-meta, see prefix.p_ctg.noseq.gfa. For metaflye, see assembly_graph.gfa (although note that this is the path graph, not the same as hifiasm(-meta)'s).

My guess is that you will find a long chain-like visualization, formed by sparsely connected short contigs, for that abundant species. Metaflye might have longer contigs & less tangled graph (if not disconnected). Which of them is more useful would depend on what you want from the assembly, in my opinion.

If donor1 and donor2 are very similar, you may also consider trying a co-assembly.