Smoothing methylation data causing very extreme values impacting visualization
rauldiul opened this issue · 4 comments
Hi,
thank you for your work in general, and this amazing package in particular.
I have a question/comment regarding the use of smooth = T
within the normalizeToMatrix()
function. I have seen that you usually recommend it for methylation data (such as your example in the package vignette), and indeed it does improve visualization quite a bit. However, I have noticed that the smoothing can lead in some cases to very big values which fall well outside the biological range of [0,1] for methylation.
While this can sometimes be ignored, it can also lead to a bias in the resulting plot. In the image below I show an example of methylation data of some gene clusters at their TSSs. The left plot shows the "raw" (not smoothed) data. The middle plot shows the "smoothed" data. You can see, in the middle plot, how smoothing pushes the lineplots upwards generating a big effect size. This effect can be prevented by capping the smoothed values to the [0,1] interval (in this case, setting all >1 values in the matrix to 1). The resulting plot (right plot) looks very much like the raw plot (left plot).
(To give you a sense of how many points are affected: in this example around 1.3% of the points in the smoothed matrix go >1, with a maximum value of 110.)
Do you think that the capping strategy is a reasonable approach to be used routinely for any dataset? That is, to restrict the output of the smoothing to the min and max of the original raw values (maybe even for non-methylation data).
thanks for your help!
PS: the code used for the matrix normalization was this one (changing smooth
to T
or F
)
normalizeToMatrix(signal = x,
target = tss.gn.goi,
value_column = "meth",
extend = 1000,
mean_mode = "absolute",
background = NA,
smooth = F,
w = 50)
Thanks for your exploration!
I think the main reason is as shown in upstream 1kb and downstream 1kb, the values are more sparsely distributed, which means it provides less information for the smoothing for rows.
Setting a limit for smoothing is a great idea! I will implement it soon.
Thank you for the quick feedback! Yes, it seems related to the sparsity of values. In this case it is data with good coverage (the input has around 18 million CpGs with information), so the issue will probably arise with any WGBS data due to the genomic sparsity of CpGs in general.
great, thanks your work!
Now I added a new argument limit
for this purpose. By default, if all signal values are within [0, 1], it automatically set limit [0, 1] to the smoothed values.
mat <- normalizeToMatrix(
signal = signal,
target = target,
extend = 1000,
value_column = "meth",
include_target = TRUE,
mean_mode = "absolute",
smooth = TRUE,
limit = c(min(signal$meth), max(signal$meth))
)
Would this cause and damage?