hifiasm-meta produces redundant assemblies?
Closed this issue · 6 comments
Hello,
I performed de novo assembly on two human faecal metagenomes sequenced with PacBio Sequel II.
I tested metaFlye (2.9-b1768) and hifiasm-meta (v0.2.1).
As you can see below, hifiasm-meta produces much larger assemblies.
I mapped on the PacBio assemblies Illumina paired-end reads obtained from the same samples.
Even if the assemblies of hifiasm_meta are much larger, the proportion of mapped reads only increases slightly.
In addition, the proportion of reads aligned exactly 1 time is much lower.
This suggests that hifiasm-meta produces redundant assemblies.
What do you think?
Thanks for you help,
Florian
Donor 1
metaFlye | hifiasm_meta | |
---|---|---|
assembly size (bp) | 596 522 308 | 831 187 874 |
# contigs | 9 253 | 15 586 |
N50 (bp) | 164 736 | 132 052 |
% illumina reads aligned concordantly exactly 1 time | 50.79 | 39.45 |
% illumina reads aligned concordantly > 1 time | 23.50 | 38.31 |
% illumina reads aligned concordantly | 74.29 | 77.76 |
Donor 2
metaFlye | hifiasm_meta | |
---|---|---|
assembly size (bp) | 264 656 715 | 551 812 461 |
# contigs | 3 836 | 17 080 |
N50 (bp) | 243 801 | 44 732 |
% illumina reads aligned concordantly exactly 1 time | 55.28 | 20.34 |
% illumina reads aligned concordantly > 1 time | 33.15 | 74.26 |
% illumina reads aligned concordantly | 88.43 | 94.6 |
I think long contigs (>1Mb) should be non-redundant for both metaflye and hifiasm-meta. We measured this by pairwise mash distance. For long circular contigs, the mash distance is usual no less than 0.01. (edit: by pairwise, I mean within-assembly pairwise, not comparing metaflye and hifiasm-meta.)
For shorter contigs, the redundancy depends on phasing. Phased contigs still share sequences, and I guess they are often longer than SR. I think measuring the real duplication rate with SR alignment is difficult.
If the length distribution of the hifi reads is wide, it is possible that hifiasm-meta produces some redundant contigs due to contained read rescue. But this should be rare.
By the way, what is size of the two libraries? The second hifiasm-meta assembly looks fragmented, do you have the median or N50 read QS?
Hifiasm-meta sometimes may completely separate strains with a couple of percent divergence. Many Illumina reads would be multiply mapped to such strains.
Below are the sequencing statistics
Donor 1 | Donor 2 | |
---|---|---|
# reads | 1,645,079 | 1,591,198 |
read length (bp) (Q1;med;Q3) | 5,936; 7,620; 9,635 | 6,515; 8,626; 10,872 |
cumulative length (bp) | 13,012,430,198 | 14,075,271,142 |
Donor 2 had a much lower alpha diversity. One species is very abundant and overwhelms the others.
Edit: median QV is 40.
Thanks, the read length distribution seems wide and also on the shorter side for hifi. Donor 2 is harder, hopefully there still are a few long conigs coming out..
I agree with Heng's take about redundancy.
I mapped back PacBio reads to their corresponding assemblies and computed error rates with samtools stats by keeping primary alignments only
There is a slight difference between hifiasm_meta and metaFlye for donor1 but metaFlye error rate is much higher for donor2.
hifiasm_meta | metaFlye | |
---|---|---|
donor1 | 5.349535e-03 | 7.391400e-03 |
donor2 | 3.940063e-03 | 1.588192e-02 |
This indeed suggests that metaFlye collapsed closely related abundant strains.
About donor2, you might want to check the assembly graph with bandage. For hifiasm-meta, see prefix.p_ctg.noseq.gfa. For metaflye, see assembly_graph.gfa (although note that this is the path graph, not the same as hifiasm(-meta)'s).
My guess is that you will find a long chain-like visualization, formed by sparsely connected short contigs, for that abundant species. Metaflye might have longer contigs & less tangled graph (if not disconnected). Which of them is more useful would depend on what you want from the assembly, in my opinion.
If donor1 and donor2 are very similar, you may also consider trying a co-assembly.