arangrhie/merfin

Adjusted QV values are very low and not improved by merfin

ASLeonard opened this issue · 30 comments

I was using the --hist feature, and came across 2 questions. The first is that the adjusted QV looks very low, but may be reasonably as the calculation just involves adding the overcopy kmers to the missing ones.

The bigger question is that the adjusted QV doesn't change after using merfin, although the merqury QV does (yay!). I've included a subrange of the hist output, but it looks like merfin pushes everything down, going past 0. In the extreme case, the lowest pre-polish was -8.4 and -25.6 after, so it does this likely mean under-represented variations are being chosen too often?

asm

K-mers not found in reads (missing) : 22751
K-mers overly represented in assembly: 208843260.00
K-mers found in the assembly: 3170334467
Merqury QV: 64.66
Adjusted QV: 24.89

The [-2,2] range is

-2.0    1309089
-1.8    592517
-1.6    1014281
-1.4    1305191
-1.2    5581550
-1.0    30997025
-0.8    19795362
-0.6    85256699
-0.4    268855226
-0.2    350358386
0.0     2372053119
0.2     14067234
0.4     6891564
0.6     1342089
0.8     783304
1.0     6148793
1.2     365881
1.4     245394
1.6     271201
1.8     149874
2.0     312288

I then did the same on the assembly I had polished with deepvariant+merfin. This was done before some of the new changes, so was called with bcftools consensus -H 2 as in the docs at the time.

K-mers not found in reads (missing) : 4124
K-mers overly represented in assembly: 209126554.00
K-mers found in the assembly: 3170649177
Merqury QV: 72.08
Adjusted QV: 24.89

and the same subrange

-2.0    1401373
-1.8    627265
-1.6    1074023
-1.4    1381121
-1.2    5639493
-1.0    30930969
-0.8    19649484
-0.6    85134989
-0.4    268863845
-0.2    353061409
0.0     2369464081
0.2     14059266
0.4     6922802
0.6     1347577
0.8     788057
1.0     6163117
1.2     384571
1.4     250586
1.6     288123
1.8     145490
2.0     320921

Hi @ASLeonard,

It seems like the k-mers overly represented are relatively high.
This means, you have extra copies of stuff in your assembly, possibly false duplications.

Have you ran Merqury? That might visually show if those are in the easy-to-purge out 1cp or 2cp region.
The spectra-cn plot should show the additional copies if these exist.

Although Merfin tries to adjust and correct errors, it can't remove high copy k-mers randomly in the assembly.
Also, it's also possible that there weren't any variants called in those regions.

Asking out of curiosity. Have you turned on the -by-kstar option in your previous run?

I just used the default example style, so I believe -by-kstar was not used. I'm re-running it again now as that is now default to on. The unpolished spectra-cn plot has a slight peak in 2 copy kmers.
asm asm contigs spectra-cn ln

I'll compare the merqury plot for the post-polish assembly and see how that looks.

I reran merfin with the new default of -by-kstar. The docs don't seem to have been updated on the best way to call the consensus, so the first way was similar to the scripts version of using bcftools norm && consensus -Hla

K-mers not found in reads (missing) : 18076
K-mers overly represented in assembly: 207357287.00
K-mers found in the assembly: 3170377356
Merqury QV: 65.66
Adjusted QV: 24.93
-2.0    1280993
-1.8    592376
-1.6    1003399
-1.4    1296594
-1.2    5575365
-1.0    30528744
-0.8    19733274
-0.6    85136000
-0.4    268178192
-0.2    336854096
0.0     2387440960
0.2     13872455
0.4     6755631
0.6     1321815
0.8     785120
1.0     6076214
1.2     361292
1.4     235258
1.6     279971
1.8     142036
2.0     305801

The other was one that is now commented in the code, using consensus -H 1

K-mers not found in reads (missing) : 18074
K-mers overly represented in assembly: 207357279.00
K-mers found in the assembly: 3170377356
Merqury QV: 65.66
Adjusted QV: 24.93
-2.0    1280969
-1.8    592376
-1.6    1003399
-1.4    1296594
-1.2    5575365
-1.0    30528770
-0.8    19733274
-0.6    85136000
-0.4    268178192
-0.2    336854096
0.0     2387440995
0.2     13872385
0.4     6755666
0.6     1321815
0.8     785120
1.0     6076214
1.2     361292
1.4     235258
1.6     279971
1.8     142036
2.0     305801

Overall, they seem very similar, so it doesn't seem like there is much difference between them.

Interestingly, there is minimal Merqury QV improvement, but now there is a a tiny (0.04) improvement in merfin QV which there wasn't before. The 0 bin also improves more, so it definitely seems like there is more positive polishing happening here, but just has a small impact.

Your merqury spectra-cn plot seems already very clean, with little base error or false duplication, so I wouldn't worry too much on the QV* values.

We saw lower QV* values in more complete genomes, which we found so far was from the sequencing biases in high copy kmers. When your assembly was correctly assembled, but ex. the GC biases in Illumina reads had dropouts in those regions, kmer multiplicity found in the read set will significantly lower than what should be found in the true genome.
Also, errors in highly repetitive kmers might also accidentally turns that kmer into a true kmer, that should be found elsewhere in the genome.

In either case, all excessive copy numbers found in the assembly are counted as errors.

We are in discussion regarding how to handle this. I would recommend to use Merqury QV until we finalize the QV*. The only difference is that QV from Merqury does not account for expected copy numbers and uses only what was wasn't found in the read set as 'error' and computes the QV.

gf777 commented

hi @ASLeonard,

my only addition to @arangrhie's last comment is asking whether you have generated and provided the lookup table. Kmer multiplicity estimates can also be off because of heterozygosity, and that should mitigate the issue.

That said, you are right, we have no way of controlling when you are adding too many kmers of the same kind that used to be underrepresented, the calls should be plausible to start with. This would imply 2 quite complicated things (the second one far more complicated than the first one): 1) updating the kmer counts in the assembly every time a variant is applied; 2) evaluating the best combination of variants whenever two or more variants produce the same kmer consequences. Only the latter would guarantee that your polishing is a monotonic function.

Yeah I'm sure it nontrivial to account for why there are over/under copies. I was naively hoping the k* hist would be some sort of nice distribution and then QV could be penalised by its variance, but no luck.
merfin_khist

@gf777 I haven't been able to get the lookup script working. I couldn't find any clear docs on what the hist input is meant to be to lookup.R. Is one of the outputs of mequry able to be used for that?

gf777 commented

Hi @ASLeonard ,

you're are right, plotting this is not trivial :-)
Yet, your histogram is similar to others I have seen before. Obviously, when you log transform the y-axis, you come to appreciate the tails, however this usually misses the point that the big difference is at 0.0. In your case I think 2387440995-2369464081=17976914. These are actually a lot of overrepresented kmers gone. The problem is that having so many overrepresented kmers (207357279) makes 18M a very small number in comparison (and the QV* estimate still being low and not increasing in a really perceptible way). My impression is that these about 207357279/21=9874156~10M bases represent a combination of duplications that you wouldn't be able to see in the spectra plots being comparatively small, plus sequencing bias (plus the bias from the naive estimator used if a lookup table is not provided, see below). Obviously you cannot correct duplications with polishing (and also cannot correct most of the right tail of the distribution, mostly represented by missing repeats).

I have added a more informative description to the inputs of lookup.R here:

https://github.com/arangrhie/merfin/tree/master/scripts/lookup_table

That said the hist is just the regular 2-column tsv used in Genomescope2. You could post the plots generated too. As a curiosity, which genome are we looking at?

Best,

Giulio

This is an assembly for a bovine animal (~2.7gb).

After rerunning with the lookup table generated by the script, something strange seems to have happened. The merqury QV dropped, and many more kmers now show as missing. The only difference here was passing -lookup lookup_table.txt after generating with lookup.R readmer_meryl_hist.tsv 21 . 1. I reran the analysis without it, and got the same results as posted previously, so it definitely wasn't a file corruption error or anything like that.

K-mers not found in reads (missing) : 1997777
K-mers overly represented in assembly: 206650300.06
K-mers found in the assembly: 3170377356
Merqury QV: 45.23
Adjusted QV: 24.90

The summary.txt is

GenomeScope version 2.0
k = 21

property                      min               max               
Heterozygosity                0.185265%         0.185612%         
Genome Haploid Length         NA bp             NA bp             
Genome Repeat Length          1,258,411,017 bp  1,258,469,102 bp  
Genome Unique Length          1,823,192,280 bp  1,823,276,434 bp  
Model Fit                     97.2226%          98.9707%          
Read Error Rate               NA%               NA%    

The generated plots
fitted_hist.pdf

gf777 commented

two quick comments:

  1. did you run the latest version? I udpated it yesterday as there was a bug affecting some cases.
  2. Since I assume this is a diploid species, you should use 2 instead of 1 as final parameter. As you can see in the plot otherwise you fit the diploid peak to 1. Can you try that?

Re: merqury QV, actually this is expected, however I realize that still calling it merqury QV is misleading. Merqury QV is only based on missing kmers, so for instance a 1x kmer in the reads is sufficient not to flag the kmer as error in the assembly. This obviously inflates the QV. When the lookup table kicks in, the qv provided is based on the kmer distribution, and each kmer is weigthed for its frequency so low copy kmers also contribute to the qv estimate (but not overcopy kmers).

gf777 commented

I have added details on how the various QVs are computed in the readme.

I reran the -vmer and -hist using the updated lookup script with the 2 parameter. I was previously hesitant to use the diploid version as some of the summary stats appeared broken, but I haven't used genomescope before so maybe this is okay.

property                      min               max               
Heterozygosity                -33.1441%         66.0161%          
Genome Haploid Length         NA bp             NA bp             
Genome Repeat Length          570,618,150 bp    570,642,614 bp    
Genome Unique Length          937,162,626 bp    937,202,804 bp    
Model Fit                     97.0323%          98.0589%          
Read Error Rate               NA%               NA%     

The new results are below for the hist summary. It looks like after using the lookup table, there are now fewer kmers found in the assembly and more found overrepresented/missing.

K-mers not found in reads (missing) : 46473078
K-mers overly represented in assembly: 208331189.48
K-mers found in the assembly: 3170276472
Merqury QV: 31.53
Adjusted QV: 24.00
gf777 commented

Hi @ASLeonard, thanks. Could you share also the fitted hist, prob and lookup table (and maybe your hist)?

One important point to make here is that so far we have been testing Merfin on pseudohaplotypes and trios, where the K* computation is reasonable as they tend to respect the agreement between the reads and the assembly. This could suggest that it is advisable to -disable-kstar for haploid representations of diploid assemblies.

I have the trio for this assembly, so I'll try again on the haplotype resolved assembles rather than the collapsed primary and see if that is different.

In the mean time here are the fits, table, plots and hist.
lookups.tar.gz

gf777 commented

The trio is a much better alternative!

Re: lookup, this is still fitting the large peak at 1, and therefore now making an even worse job. It's an issue I have already noticed in 1-2 cases (still wondering if it's due to genomescope2 or my modifications to it). I will play a bit around it and let you know.

gf777 commented

So the same quick and dirty fix I used in the past works for your cow as well. In general I think it's due to some weaknesses in Genomescope modelling (Merfin is using an hybrid between v1 and v2) when coverage is kind of low (this is something they mention in Genomescope2 paper). Indeed, Genomescope does not fit even in the online version with your histogram: http://qb.cshl.edu/genomescope/analysis.php?code=xRgTCVSgYx8EI3Aqroou. The fix is simple, just forcing Genomescope to perform only 2 iterations instead of the default 4 to avoid overfitting. Check the results in attached, they look as expected now. I will ping Shatz's team to see if there is an easy fix to this.

cow.lookup.zip

I edited the lookup script to perform 2 iterations, recovering the results you shared.

I ran the same steps on both haplotypes in the trio, using the canu binned reads as readmers and the lookup script using haploid mode.

hap1

K-mers not found in reads (missing) : 92972690 -> 92862901
K-mers overly represented in assembly: 818504749.17 -> 818758437.76
K-mers found in the assembly: 3212470322 ->3212469264
Merqury QV: 28.55 -> 28.55
Adjusted QV: 18.02 -> 18.02

hap2

K-mers not found in reads (missing) : 135050878 -> 135002228
K-mers overly represented in assembly: 804244208.75 -> 804339410.70
K-mers found in the assembly: 3184657693 -> 3184658016
Merqury QV: 26.86 -> 26.86
Adjusted QV: 17.82 -> 17.82

I initially reran for hap1 just using the full set of reads and the full lookup_table in diploid mode. The values compared to the polished version are pretty identical, but are higher. I found that odd, as there fewer kmers missing despite having an entire haplotype of "un"correlated kmers.

K-mers not found in reads (missing) : 3734346
K-mers overly represented in assembly: 254279050.41
K-mers found in the assembly: 3212470322
Merqury QV: 42.57
Adjusted QV: 24.00
gf777 commented

Hello @ASLeonard ,

this is very interesting, thanks for sharing. In general, I have tested so far only non-trio (pseudo-haplotypes) samples, and what I see is that with the lookup table the corrected qvs are always consistent (when you polish you get a better QV, either corrected merqury or adjusted), but obviously the theoretical assumptions are very important here. What I currently evaluated is the two pseudo-haplotypes combined.

We are actually testing on a cattle trio right now as well. I hope we will be able to report and compare results soon.

gf777 commented

Hi @ASLeonard ,

If you have time are you willing to try the new version of merfin? It should have fixed some bugs relevant to this discussion. We also have a -completeness option too now.

Thanks in advance for the feedback!

The parellelisation is great, much faster than before, so huge thumbs up. A few initial comments

I am using your fork of gs2. The parameters need to be explicit when called, i.e. -i {input} -k {kmer} -o {output} etc, which isn't in the readme. I am also using ploidy of 2 for the trio-binned assemblies, as using ploidy of 1 had poorly fitted hists.

The completeness op currently just prints to stdout, the method for reportCompleteness is still unimplemented. The results again seem oddly low for both the polished and unpolished assemblies. Merqury estimates 93% completeness for the pre-merfin assembly.

TOTAL readK:     3564116228.00 -> 3564116228.00
TOTAL undrcpy:    1310196571.00000 -> 1309631327.00000
COMPLETENESS:    0.63239 -> 0.63255

For the QV, it appears semi-consistent with previous results, although more of a drop for Missing QV compared to merfin QV.

K-mers not found in reads (missing) : 163851630 -> 163388569
K-mers overly represented in assembly: 792760479.88 -> 792638058.73
K-mers found in the assembly: 3212470322 -> 3212454354
Missing QV: 26.04 -> 26.05
Merfin QV*: 17.77 -> 17.78

I just ran it again with a resolved haplotype from a different trio, and similar results.

Completeness of around 58% (c.f. Merqury of 91%), with Missing QV 37.3, Merfin QV 23.5 (c.f. Merqury QV 45), with essentially no change after polishing.

gf777 commented

Hi @ASLeonard thanks for the prompt feedback!

So the completeness implemented in Merfin differs from Merqury for what I think is a good reason: while Merqury will consider a kmer as present if it's present at least once in the assembly, regardless of the number of copies that would be expected from the raw reads, Merfin takes into account for all kmers the number of copies that should be present given their frequency in the read set. In this sense it is in principle much more accurate as real estimate of completeness. Imagine a highly collapsed repeat: for Merqury it would look complete, Merfin will understand that many copies are missing.

While the situation with trios is still confusing, I can definitely say that you should provide both assemblies for this analysis (or expect no more than 50% completeness in a perfect scenario). What is still suspicious to me is why your total readK is 3.5G. Since this is a cow the total readK should be about 6G (any chance you have used the wrong peak, i.e. kcov*2?). Can you share the plot from gs2 (btw thanks I updated the readme)?

Here is the fitted hist for the other haplotype in the new trio. I think it looks okay, but I've rarely used gs2.
fitted_hist

Regarding the wrong peak, that is quite likely. I didn't update it for the new trio with different coverage, and probably was using the wrong value anyway. For clarity, if we have 30x coverage, the estimated peaks for a trio be around 15 each, and this is the supplied value for -peak? It may not be straightforward, but is there any chance part of the output of your custom gs2 would be a single peak value that could be fed into merfin? Would make pipelines much easier.

I'll try re-running this trio with better -peak value and update then.

gf777 commented

cool, yes the plot looks good, and yes you should use the haploid peak and combine the assemblies. The haploid peak value is actually already provided by genomescope, it's called kcov in the regular plot, and it's also present in the summary.txt of the model. Let's see how it goes!

Currently kcov is in model.txt as a line like kmercov 1.102e+01 4.689e-03 2350.708 < 2e-16 ***. Since merfin takes double values for -peak, it is pretty easy to scrape it automatically in a pipeline,

It is looking like this may be more down to user error (sorry!), results look much better with peak value taken from gs2. I didn't fully understand your comment on combing the assmblies though. The readmers I am using are from the trio-binned data. Are you suggesting combining the two assemblies into one fasta file, and then using the total sequencing data?

For hap1

K-mers not found in reads (missing) : 2612387 -> 2565487
K-mers overly represented in assembly: 2251736.52 -> 2226712.42
K-mers found in the assembly: 2952031211 -> 2952027890
Missing QV: 43.75 -> 43.83
Merfin QV*: 41.05 -> 41.11
TOTAL readK:     5908431579.00 
TOTAL undrcpy:    2961305728.00000 -> 2961237327.00000
COMPLETENESS:             0.49880 -> 0.49881

The results are similar for hap2, but slightly lower QVs, 37.33 Missing and 37.03 Merfin. Interesting, these two QVs are very close compared to hap1. Also interestingly, the Merqury QVs for these haplotypes are 46.2 and 45.6, so relatively close, while the Missing/Merfin values are much larger in difference.

gf777 commented

Hi @ASLeonard,

that's good. Yes since the raw data contains both haplotypes, if you combine the two fasta into one for the completeness evaluation you harness the full potentially of the K* as it will be asking how many of the total kmers expected from the raw reads (TOTAL readK) you see in the full diploid assembly. On the other hand, consider that combining the haplotypes may somewhat impact negatively the QV* because if you include only one haplotype you have more degree of freedom. Imagine a kmer present only once in the genome (twice considering both haplotypes): if one of your assemblies erroneously containes two copies of it, and the other contains correctly one copy, you have 3 copies instead of 2. However, when you run Merfin with each assembly indipendently they will both be allowed up to 2 copies.

To be clear, I was previously using binned data (via triocanu), so the readmers were generated from the specific reads for each haplotype. To check the other way, I just ran the merged.fasta with the total raw reads, and got the following results.

K-mers not found in reads (missing) : 4138025
K-mers overly represented in assembly: 146399838.50
K-mers found in the assembly: 6022148488
Missing QV: 44.85
Merfin QV*: 29.19

TOTAL readK:     6137932937.00
TOTAL undrcpy:    280346024.00000
COMPLETENESS:             0.95433

The completeness appears to have behaved quite well, as well as the Missing QV. However, it looks like there are many more over-represented kmers in the merged.fasta compared with adding the over-rep kmers from each haplotype, by about a factor of 50. This looks to penalise Merfin QV substantially. Not sure if this is related to the diversity of kmers between the two assemblies causing issues with expected copy number, but the fitted plot looks okay.
fitted_hist

gf777 commented

hi @ASLeonard,

as I am sure you understood by now it all comes down to correctly estimating the expected copy number given the frequencies in the raw reads. Honestly we also see similar values of QV*. We suspect these are due to the cumulative effect of sequencing bias on high-copy kmers. Any chance you have looked into the -dump output? This will give you have very good indication as to where the supposedly extra copy stuff is

Just got around to trying the -dump option. Most regions are fairly centred around 1 with small fluctuations, but chromosome ends tend to spike. Here is a slice from one chromosome, where the the first 4mb are identified are centromeric repeats with k* = 2. There is a section within this structure with lower kstar, and repeatmasker found predicted a lower score/higher divergence for the centromeric repeat.
image

There is an inverse example on a different chromosome, where a more rare centromeric sequence has a 500kb region of k* = .25 at the start. The unplaced contigs also had far more extreme variation.

I think overall this indicates merfin is doing its job, but previous discuss we had about repetitive sequence applies again. Maybe it would be useful to use merfin on repeat-masked sequence. Alternatively, this is a much faster way of identifying possible regions of repeat sequence, which might be useful to preprocess repeatmasker etc.

gf777 commented

Hi @ASLeonard ,

your considerations are very meaningful. We had also imagined to exclude repeats for some K* related computations, particularly the QV. It is clear now that specific regions/sequences have very specific read generation rates. For instance, while your K* of 2 for the centromere may indicate somewhat collapsed repeats, it appears from the T2T experience that it is rather dependent on these different read generation rates (at least it could be a combination of both). It turns out that the general distribution of kmer frequencies (e.g. the one we use for genomescope) is the pileup of these very specific rates, rather than a random average. This is true for both illumina and hifi kmers, but particularly evident in the latter, most likely as a result of it being a continuous recording. We therefore have the impression (though not formally tested), that many false positive errors identified by the k* when computing QV averages are the result of these specific rates piling up in high-frequency kmers. Should this be the case, we would not need to mask the genome but simply outvote them internally. One obvious, yet untested, way would be to simply reduce the probability of something being an error at increasing readK.

This is pretty outdated now after changes to merfin, and the latest version seems to resolve most of the issues anyway.
Thanks for all the discussion.