kdkorthauer/dmrseq

Using dmrseq with other organisms and in a different methylation context.

Closed this issue · 6 comments

tmms1 commented

Hi @kdkorthauer

First of all, congratulations with the development of this impressive method and the clear documentation.

It is stated in the vignette that dmrseq can be used on data from organisms other than man and mouse. And that dmrseq can also handle other methylation context beside CpG. However, one should check the scalability and if the assumptions regarding spatial correlation hold. Could you elaborate on this statement? What are the assumptions regarding spatial correlation in man and mouse? How can you check them? Which correlation should be computed (Pearson, Spearman, ...)?

Thanks.
Tim

Dear Tim,

Great question. Happy to provide more pointers about applying dmrseq to other organisms and cytosine contexts.

Regarding the assumptions of spatial correlation, I am referring to the autocorrelation we typically observe in human and mouse. Specifically, we see that neighboring cytosines tend to have similar methylation levels (as opposed to being random or anti-correlated). You can check this by either (1) visualizing the methylation tracks in a genome browser or plotting region-level methylation values (looking for smoothly varying levels), and/or (2) using something like the acf function in R (looking for positive correlation values that decrease with increasing lag).

Regarding scalability, I am referring to the number of sites you are considering. With larger genomes, or when expanding to non-CpG cytosines, the number of sites under consideration can become very large. This doesn't violate assumptions of the model, but may cause you to run into computational limits in terms of memory/time. If you find that to be the case, there are things you can do to mitigate the issue, such as working on small subsets (e.g. one chromosome) at a time.

Hope this helps.

Best,
Keegan

tmms1 commented

Dear Keegan

Thanks for your reply.

I will try use acf function in R and generate some graphs.
I will use that function on my own data and also on the data you used for benchmarking. Could you beforehand give some pointers about how to pick smoothing parameters based on these results of can I post the graphs here?

Thanks in advance.

--
With kind regards
Tim

Hi Tim,

I'm afraid I haven't found a precise way to choose smoothing parameters based on acf results. However, I would recommend that if you do not see the pattern I described above (positive correlation values that decrease with increasing lag), it may not make sense to smooth.

For choosing smoothing parameters, it can be helpful to assess the results after running with a chosen set of smoothing parameters. For example, if you find many regions that are very close together, this could indicate they are actually from one larger region, and smoothing more might improve power to detect those larger regions.

You are welcome to post graphs here, and I can try to provide as much advice as I can.

Best,
Keegan

tmms1 commented

Dear Keegan

My best wishes for the new year!

I applied the pointers you gave to my data (Arabidopsis).

Attached you see the graphs of running the acf function on one sample of every context. I think this show that spatial correlation is also present in Arabidopsis regardless of the context of the cytosine. Is this correct?
(I made the same graphs for two additional samples and the results were similar).

I tried running dmrseq with the default parameters on one chromosome and comparing three groups. This gave following results:

  • CpG: 49 regions (none significant; significant was defined as q-value less than or equal to 0.05)
  • CHG: 34 regions (none significant)
  • CHH: 6 regions (none significant)

I also tried it with bpspan set 500. I opted to reduced bpspan, because neighbouring regions were well spaced. This gave following results:

  • CpG: 217 regions (none significant)
  • CHG: 223 regions (none significant)
  • CHH: 28 regions (none significant)

Is it normal that tweaking one parameter has such a larger effect on the number of detected regions?
Does the results "prove" that dmrseq with default parameters is a good way to analyze the data?

Thanks for all your effort.

--
With kind regards
Tim

spatial_correlation_acf

Dear Tim,

Please see my responses below.

I think this show that spatial correlation is also present in Arabidopsis regardless of the context of the cytosine. Is this correct?

Yes, the plots show that autocorrelation is present.

Is it normal that tweaking one parameter has such a larger effect on the number of detected regions?

It is expected that changing the smoothing parameters will alter the number of detected regions. In particular, reducing the bpSpan will result in less smoothing. Less smoothing will generally result in breaking up regions into a larger number of smaller regions.

Does the results "prove" that dmrseq with default parameters is a good way to analyze the data?

I wouldn't say that these results "prove" that default parameters are good, as there is no single 'correct' way to analyze data. But from what you have presented I don't see anything that suggests you aren't taking a sensible approach. If neighbouring regions were still well-spaced with the reduced smoothing parameters, that setting could work just as well. Otherwise, the default setting might be more desirable.

Hope that helps!

Best,
Keegan

tmms1 commented

Dear Keegan

Thank you very much for answering my questions and giving those pointers.

--
With kind regards
Tim