Using Megalodon files as input
dipannita-g opened this issue · 36 comments
Hi again!
I was trying to use Methplotlib to see if I can plot my Megalodon data for 6mA and 5mC. Simply converting my sorted BAM output using samtools to CRAM format and using that as an input is giving me the following error:
$ methplotlib -m barcode1.cram -n barcode01 -w 7e24d142adc2427d_1 -f reference.fasta -o outtrial1
Traceback (most recent call last):
File "/usr/local/bin/methplotlib", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/methplotlib/methplotlib.py", line 35, in main
binary=args.binary,
File "/usr/local/lib/python3.7/dist-packages/methplotlib/methplotlib.py", line 52, in meth_browser
meth_traces = plots.methylation(meth_data, dotsize=dotsize, binary=binary)
File "/usr/local/lib/python3.7/dist-packages/methplotlib/plots.py", line 121, in methylation
dotsize=dotsize)
File "/usr/local/lib/python3.7/dist-packages/methplotlib/plots.py", line 148, in make_per_read_meth_traces_phred
table = table.join(df_heights, on="read_name")
File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 8108, in join
other, on=on, how=how, lsuffix=lsuffix, rsuffix=rsuffix, sort=sort
File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 8140, in _join_compat
sort=sort,
File "/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/merge.py", line 87, in merge
validate=validate,
File "/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/merge.py", line 672, in __init__
self._maybe_coerce_merge_keys()
File "/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/merge.py", line 1193, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
I am very new to data analysis and I am unsure if this error is due to some issue with my input files/command syntax or if Megalodon outputs are not compatible with Methplotlib.
This looks like a bug in my code... And if it isn't a bug I would be happy to make methplotlib compatible. Do you think it would be possible to share the cram file?
Thanks,
Wouter
Hi Wouter!
Yes, I can share the CRAM file. I will share a Google drive link to the file in an email.
Thanks a lot for your time!
Hello Wouter,
Is there any update on this? I am also interested.
Thanks a lot in advance.
Regards,
Paula
Hi Paula,
Do you mean you got the same error? I hope to find time this week...
Wouter
Hello Wouter,
Maybe we can continue the thread on: all context methylation visualization #29 :)
Thank you,
Paula
Hi @dipannita-g, I am very confused. Maybe @marcus1487 will have any insight here?
The cram file you shared with me (and maybe could share with Marcus too?) is behaving oddly and it hurts my brain. It could just be Friday afternoon, but I have no idea what's going on here.
samtools idxstats tells me that there are no aligned or unaligned reads in your file, but that disagrees with samtools flagstat and simple observation using samtools view.
Converting the file to bam and back to cram puts 32890416 reads as unaligned in idxstats, but the output of samtools view looks equivalent.
Converting the sorted_mod_mappings.bam you also shared to cram (with -T for the reference.fasta) puts 26885680 reads as unaligned in idxstats.
For all three crams flagstat reports 17656 reads mapped - which looks okay to me.
Hmm tagging @marcus1487 again. I noticed that the bam file uses Mm
and Ml
tags, with a small second letter. I thought the specifications were using all caps for these tags?
samtools/hts-specs#362 (but maybe there are other threads relevant here). Methplotlib currently expects MM and MP tags. Of course, changing that in methplotlib is not too crazy :-)
I see Mm and Ml pop up in samtools/hts-specs#418...
Hi Wouter,
Thanks a lot for taking a look at this! I have no clue why the discrepancy is happening when using Samtools idxstats. Flagstat also gives me the same number of mapped reads.
I have also uploaded the original bam file which Megalodon produced, mod_mappings.bam, in the folder, if that helps somehow.
I am sharing the files with @marcus1487 as well.
Thanks again!
Dipannita
Hi @wdecoster,
It is a bit misleading. Originally MM/MP were proposed but it drifted away to Mm/Ml as implemented in Megalodon (and Guppy).
Would Methplotlib be able to cope with multiple modifications encoded on the Mm field ?
Soon! How do the Ml map to the Mm in case of multiple modifications?
Small correction on the Mm/Ml. Once it is accepted in the spec it will become MM/ML, So better make the tag parsing case insensitive.
Good news for multiple mods support. The proposed spec actually describes how several mods can be stored in the Mm tag but it is arguably a bit hard to follow. Essentially, you can concatenate multiple modifications in the Mm tag like that
Mm:Z:C+m,5,12;C+h,5,12;
With likelihood scores in the same order in the Ml tag.
Right, but without a ;
separator in the likelihood scores?
I noticed in the bam file I'm currently looking at that it's just one array of integers, with as many likelihoods as the positions for all modifications together.
The Mm tag still stores the distances between positions for that specific base with the modification, right?
Yes, there is no ;
separator in the Ml/ML scores. I think this is a requirement of the B
type in the BAM spec to allow more efficient storage as a single array. If you read the field directly from the bam/cram it would just be an array with no separators.
Yes the fields for different mods are all structured the same way and can treated independently, but the order matters in order to access the scores in the Ml/ML tag.
I forgot to mention before, but the documentation for the new proposed modification tags lives here https://github.com/jkbonfield/hts-specs/blob/7fafbdf65291da9377d9e17736c4e4bb06fea9a6/SAMtags.tex#L477
And here is a PDF version : SAMtags.pdf
Question for @marcus1487 or @a-slide: do you have a suggestion of minimal cut-off of Ml for megalodon? I see the output includes a probability of C+m for all Cs in the sequence, but most have just a very low probability, I guess, and it doesn't make much sense to show them all does it?
I have just pushed a bunch of code that should enable megalodon cram/bam format in methplotlib v0.19.0, and I am looking forward to your feedback.
I would very much appreciate it if @marcus1487 or @a-slide would have the time to take a look at my code for handling the Mm/Ml tags, the relevant bits are in this function. I have commented quite heavily, as it is (at least to me) not the most trivial task to map the number-of-bases-to-skip to genomic coordinates.
Hi Wouter,
I had a look at your code and it seems to be parsing the delta encoding mods correctly. It is indeed not trivial, but this encoding was chosen for compactness reasons (I personally think a simple offset relative to the start would have been better).
I still need to test it and compare it to the recently implemented equivalent functionality in IGV. If you want to have a look at real data, I put together a small human chr20 5mC toy dataset there https://nanoporetech.box.com/s/82pnw3lhusfs93s0vj7azxiz4x7kxabo
For the cut-off, @marcus1487 will know better, but I suspect it might be different depending on the model used, so maybe leave it to the users to decide where to put the bar. Or just use a transparent to solid color scale
Aha thanks, I'll have a look at the testdata.
Yes, the cutoff is an option the user can specify, but I want to set a sensible default. A problem is that plotting a dot for every C and every A gets quite dense and a heavy html. The current implementation omits points below a cutoff and uses a colorscale for the rest
I agree that a cutoff would be very useful. In megalodon the cutoff is 1%, but this is a command line parameter with the default to output as much of the data as is reasonable. This is likely too low a cutoff for visualization. In megalodon a read is only included in the per-site aggregated "modified" base count when the per-read probability of a modification is greater than 0.8. So to give a bit of wiggle room I think a default of 0.6-0.7 would be a reasonable default.
Another relevant issue here is calibration of these scores. Megalodon calibrates the modified base scores to match emperical data with a ground truth. Guppy produced calls on the other hand are not calibrated. So the shading/color scale may need to be adjusted for different callers.
For the code, a couple minor comments:
- on line 273
offset = len(deltas)
should beoffset += len(deltas)
(won't cause bug now as we're not calling three mods at once... yet) - Not sure on this one, but should
basemod
bemod
on line 280? Or are you using the canonical base later in processing?
Otherwise this looks great!
P.S. I agree that get_forward_sequence
result is a bit odd. And +1 for using all the pysam commands!
Megalodon input works great for me, using the mod_mappings.bam. However I don't like the color. It's just a light beige to dark red palette. Is there a way to specify it to go from blue (unmethylated) to red (methylated) as I see when using nanopolish output?
Hi @dithiii thanks for the feedback! I'm not sure about your request.
The difference with nanopolish is that it symmetrically has a log-likelihood ratio to indicate if a position is more likely modified or more likely unmodified. Megalodon in contrast has 'just' the probability that a position is modified.
This problem also relates to setting a cut-off for the likelihood score that was discussed above, and I could (instead of hiding those) mark the positions below the cut-off with a single shade of blue.
@marcus1487 Thanks for the correction on line 273 - that would have been a nasty bug in the future.
Line 280 just ends up as annotation in the graph and better spells out the modification the user is seeing in that panel, so basemod (C+m) makes a bit more sense than just mod (m).
You mention the cut-off in percent above - and what is in the cram/bam is then Phred scaled?
Ah, it's worth noting that the Ml
tag is not required according to the spec -- it's possible to only have Mm
. That shouldn't happen from either megalodon or guppy output as far as I know, but it is a possibility.
Aha, great, I'll make some changes to avoid crashing on that.
Hello,
Maybe the colours are light beige to dark red, and not "down" to blue, because the megalodon probabilities are shown on a natural log, whether the log-likehood from nanopolish are log 10, therefore the spread in range probabilities (and so colours) is wider?
Thank you all for the work,
I am going to try it out!
Best,
Paula
I'm a little confused on how to show different modifications. In my case, I am using a mod_mappings.bam that I sorted and loaded into methplotlib. It has calls for both 5mC and 5hmC. But the colors on the output are limited to simply "likely modified". Ideas on how to deal with this? I'd love different colors ranges for mods.
In the meantime though, I would just like to know how to filter a bam file for a specific mod so I can make separate plots for each of the two mods. Help?
The separation of the 5mC and 5hmC calls is a bit complicated. For example given probabilities of 5mC=0.2 and 5hmC=0.3 and C=0.5 in a read at a particular site, I see three "logical" ways to output this call into a file with only C and 5mC. a) 5mC=0.5 C=0.5 b) 5mC=0.2 C=0.8 and c) 5mC=0.35 C=0.65. The problem is that given probabilities of all three the conversion to the other 2 is ambiguous and any of the above 3 options could be argued for with valid points.
For this reason I think if only one modified base is required that should be modeled explicitly or the modified bases should be considered together. To that point the 5mC model is likely to represent option a
above since 5hmC is collapsed in alternative bisulfite data used to train that model (see megalodon training data prep here).
One additional point not really related to methplotlib, but which may be presenting as an issue here is that the 5mC+5hmC nanopore model currently generally produces weak 5hmC calls. This is due to issues with our current training data where 5hmC training data is likely not completely 5hmC where we expect it. We are working to improve this in a later version of this model. In this context though this would present as the 5hmC mods showing as very light in almost all cases, so it may be that the 5hmC calls are simply hard to see given the lower probabilities.
Thanks Marcus, I see the issue. I called bases using the rerio 5mC+5hmC model and honestly I found it to work very well. The genome is mitochondria and I saw unrealistically high 5mC when calling with the 5mC-allcontext model. It called way more 5mC than is really there. Nanopolish called much lower, but still had a respectable amount. Then I called again with the 5mC+5hmC model and found essentially zero 5mC and much more 5hmC. Those results actually make much more biological sense to me (since I believe that most mtDNA 5mC is not real anyway, and any 5hmC would be misread as 5mC using bisulfite methods).
So I give your model more credit than you do. I think it's doing a great job on calling 5hmC in my case! My remaining problem is how to visualize zero 5mC and low but consistent levels of 5hmC. That's why I wanted to split the bam file. I see that it's not easy. I'm thinking of alternatives.
@dithiii That is great to hear! Glad you are seeing good results! I guess I should turn up my optimism for that model in its current state.
If there is a valid use case and reasonable expected behavior for a splitting method, I would be open to adding a command for this to megalodon. I think option c
is the most reasonable. It would just produce very weird results where the other modified base was high probability. For example with C=0.002 5hmC=0.008 5mC=0.99, it seems a bit weird to report C=0.2 and 5hmC=0.8 probabilities, but maybe this is the most logical thing to do (it feels a bit hacky).
"For example with C=0.002 5hmC=0.008 5mC=0.99, it seems a bit weird to report C=0.2 and 5hmC=0.8 probabilities, but maybe this is the most logical thing to do (it feels a bit hacky)."
Right tricky question, I don't think that's a good way to report it, assuming someone asks "is it C or 5hmC" at that position, I don't think it's a logical thing to report "well, it's 0.8 likely to be 5hmC over the 0.2 probability of C" when the model really thinks it's 99% likely to be 5mC. It's not a reliable report. That being said, I don't know how to resolve that issue. Personally I would rather it just say, "it's 0.002 likely to be C and 0.008 likely to be 5hmC" and leave the user scratching their head and asking "I wonder what the other 99% call is" and hope they ask the right question.
Yes. That would be more ideal though still problematic, but this is unfortunately a limitation of the file format. It is assumed that the probabilities would sum to 1. So the probability not found in modified bases is assumed to be the unmodified form.
Hello,
I have tried to use methplotlib from megalodon BAMs.
-
for 5mC
-
for 5mC and 5hmC
I think the format is not right...
Please find below the command I used and the plot I am getting.
methplotlib -m hipox_sub_25.cram -n calls -w chr1:3746460-3771645 -f GRCh38.primary_assembly.genome.fa -g Homo_sapiens.GRCh38.90.chr.gtf
methplotlib -m 5hmc_hip_sub.sorted.cram -n calls -w chr1:3746460-3771645 -f GRCh38.primary_assembly.genome.fa -g Homo_sapiens.GRCh38.90.chr.gtf
Can you spot what is the problem?
You feedback would be very much appreciated,
Thank you in advance,
Paula
Ehm I don't directly have an idea, do you think it would be possible to share the dataset that causes this?
Thanks,
Wouter
Good morning Wouter,
Please find the dataset in the email I just sent you.
Thank you so much for having a look into this.
Very much appreciated,
Kind regards,
Paula