arangrhie/merfin

diploid assembly, haploid reads

Closed this issue · 4 comments

Questions: 1. In analyzing an assembly made from diploid DNA using Illumina reads from haploid DNA (a meiotic product from the diploid individual), should ploidy be set to one or two for Merfin? 2. If uncollapsed haplotypes seem likely to be present, is it better to try to purge duplicated regions before polishing, or after? 3. Is it possible that Genomescope2 might underestimate genome size if the genome contains large amounts of repetitive sequences?
Details: I am using Merfin to assess the quality of a long-read assembly made by wtdbg2/redbean from about 25x coverage of a mix of ONT PromethION and PacBio RSII reads. The long-read libraries and assembly were made from diploid DNA, but the Illumina reads I'm using for kmer comparisons were made from a haploid meiotic product from the diploid parent. I take this to mean that the Illumina reads should lack kmers found in the assembly, but any kmers present in the reads and not in the assembly should be either sequencing errors or assembly errors. The size of the wtdbg2 assembly is about 40% larger than the expected haploid genome size of 22.5 Gb, but the Genomescope2 analysis of the (haploid) reads and (diploid) assembly using the diploid setting (-p 2) estimated genome size around 6.5 Gb. Genomescope2_transformed_linear_plot.png.
Using Merfin -hist with -sequence <diploid.assembly.fa> -readmers <haploid.reads.meryl> -peak 46 -prob <Genomescope2.lookup.table> -output merfin.hist produces output that is skewed toward the negative, and the stderr output confirms that distribution.
K-mers not found in reads (missing) : 13,284,222,107
K-mers overly represented in assembly: 4,018,646,355.17
K-mers found in the assembly: 32,079,612,419
Plotting the merfin.hist table in R, and zooming in on the region from -50 to 50 on the x-axis shows there are three different traces, as though the assembly is a mix of three different fractions with different K* distributions. One hypothesis is that the PromethION and PacBio RSII reads give rise to regions with different properties, or perhaps non-collapsed haplotypes that are redundant. I haven't tried polishing this assembly yet, so I don't know if that might help.
merfin.hist_plot.png

gf777 commented

Hi @rwhetten

What a big genome!

"I take this to mean that the Illumina reads should lack kmers found in the assembly, but any kmers present in the reads and not in the assembly should be either sequencing errors or assembly errors." YES

  1. if your read set represents an haploid genome, I think ideally you'd want to try to separate the two haplotypes as much as possible (i.e. have a haploid representation of the genome). You will still have issues with kmers from SNPs and such as the two haplotypes will be scrambled, but at least the diploid fraction will be represented faithfully.

  2. Even if you were to purge the genome, you'd normally combine primary and alternate assemblies into one when running Merfin. Here the complication is that the reads are haploid. So I guess 1) applies, and you'd want to purge first

  3. Yes, definitely. I suppose that this big genome is due to some massive stuff repeated over and over. Check the gs2 plot in log scale, all that is essentially not modelled by gs2

K-mers not found in reads (missing) --> this is usually evidence of either false duplications or retained haplotypes. Have you tried to generate Merqury spectra plots with the illumina data? Not sure what to expect given the haploid nature of your illumina data but could be helpful

Thanks for the quick reply! I had run the Genomescope2 modeling with -m 100 because the interesting peaks are at lower multiplicity - repeating that analysis with -m 200 million gave a size estimate of 17 Gb - not quite correct, but closer than before. There are a lot of kmers with multiplicity > 1 million.
A Merqury spectra_cn plot of the diploid wtdbg2 assembly compared to the haploid Illumina reads shows kmers at the same multiplicity in the reads with varying copy number in the assembly, from 0 to 4. I take this to mean the assembly is missing some non-error sequences present in the Illumina data, while some sequences present once in the Illumina data are present 2 to 4 times in the assembly. I'm running a minimap2 alignment of the long reads to the assembly for use in purging, but everything takes a long time with this size genome.
specitra-cn_plot_image

gf777 commented

That's cool. Yes, I'd say the results with gs2 are expected, but why are you limiting kmer cov? It sounds like an example where you don't want to do that.

"I take this to mean the assembly is missing some non-error sequences present in the Illumina data, while some sequences present once in the Illumina data are present 2 to 4 times in the assembly. " YES

What is interesting to me is that the different copies are all at the same kmer multiplicity. I'd say that with haploid reads you expect both the haploid and diploid fraction of the genome to be at haploid coverage, however 4 copy and above will be seen 2, 3, 4 etc number of times even in a haploid read set. So basically all green purple orange and some of the blue is false duplications..

@gf777 I included the -m option because I thought it was required - 200 million is larger than the highest multiplicity, so the results from genomescope2 are the same without an upper limit.
Thanks for your guidance in interpreting the output plots. I'll try some different approaches to purging the false duplications.