Smoothing methylation data causing very extreme values impacting visualization

Question

Smoothing methylation data causing very extreme values impacting visualization

rauldiul opened this issue 3 years ago · 4 comments

Hi,

thank you for your work in general, and this amazing package in particular.

I have a question/comment regarding the use of smooth = T within the normalizeToMatrix() function. I have seen that you usually recommend it for methylation data (such as your example in the package vignette), and indeed it does improve visualization quite a bit. However, I have noticed that the smoothing can lead in some cases to very big values which fall well outside the biological range of [0,1] for methylation.

While this can sometimes be ignored, it can also lead to a bias in the resulting plot. In the image below I show an example of methylation data of some gene clusters at their TSSs. The left plot shows the "raw" (not smoothed) data. The middle plot shows the "smoothed" data. You can see, in the middle plot, how smoothing pushes the lineplots upwards generating a big effect size. This effect can be prevented by capping the smoothed values to the [0,1] interval (in this case, setting all >1 values in the matrix to 1). The resulting plot (right plot) looks very much like the raw plot (left plot).

(To give you a sense of how many points are affected: in this example around 1.3% of the points in the smoothed matrix go >1, with a maximum value of 110.)

Do you think that the capping strategy is a reasonable approach to be used routinely for any dataset? That is, to restrict the output of the smoothing to the min and max of the original raw values (maybe even for non-methylation data).

thanks for your help!

PS: the code used for the matrix normalization was this one (changing smooth to T or F)

normalizeToMatrix(signal = x,
                    target = tss.gn.goi,
                    value_column = "meth",
                    extend = 1000,
                    mean_mode = "absolute",
                    background = NA,
                    smooth = F, 
                    w = 50)

Answer 1 · 2021-09-10T11:19:35.000Z

Thanks for your exploration!

I think the main reason is as shown in upstream 1kb and downstream 1kb, the values are more sparsely distributed, which means it provides less information for the smoothing for rows.

Setting a limit for smoothing is a great idea! I will implement it soon.

Answer 2 · 2021-09-10T11:56:30.000Z

Thank you for the quick feedback! Yes, it seems related to the sparsity of values. In this case it is data with good coverage (the input has around 18 million CpGs with information), so the issue will probably arise with any WGBS data due to the genomic sparsity of CpGs in general.

great, thanks your work!

Answer 3 · 2021-09-29T10:23:16.000Z

Now I added a new argument limit for this purpose. By default, if all signal values are within [0, 1], it automatically set limit [0, 1] to the smoothed values.

Answer 4 · 2024-08-27T16:43:04.000Z

mat <- normalizeToMatrix(
  signal = signal,
  target = target,
  extend = 1000,
  value_column = "meth",
  include_target = TRUE,
  mean_mode = "absolute",
  smooth = TRUE, 
  limit = c(min(signal$meth), max(signal$meth))
)

Would this cause and damage?