Filtering: Possible bug and a question
Opened this issue · 0 comments
Hi,
(1) I think the paper Methods and the actual code disagree in calculating the Ms
score. In the paper the text reads
... we calculate a score, Ms
, for each putative microexon as Ms =1 − (1 − PsPU2)/n
, where PU2
is the probability that the observed U2 score came from the Gaussian with the higher mean and n
is the number of matches for a given intron.
The corresponding calculation in the code seems to be (from https://github.com/hemberg-lab/MicroExonator/blob/master/src/final_filters3.R, lines 85-95)
fit_U2_score <- normalmixEM(ME_matches_filter$U2_score, maxit = 10000, epsilon = 1e-05)
#ggplot_mix_comps(fit_U2_score, "Mixture model Micro-exon >=3 after coverge filter")
post.df <- as.data.frame(cbind(x = fit_U2_score$x, fit_U2_score$posterior))
#ME_final <- ME_centric_raw[ME %in% uniq_seq_filter & len_micro_exon_seq_found>=3, ]
if(fit_U2_score$mu[1]<=fit_U2_score$mu[2]){
ME_final$ME_P_value <- 1 - (1 - approx(post.df$x, post.df$comp.1, ME_final$U2_scores)$y * ME_final$P_MEs) / ME_final$total_number_of_micro_exons_matches
} else {
ME_final$ME_P_value <- 1 - (1 - approx(post.df$x, post.df$comp.2, ME_final$U2_scores)$y * ME_final$P_MEs)/ ME_final$total_number_of_micro_exons_matches
}
So if ME_final$ME_P_value
should be identified with Ms
the code calculates 1 − (1 − Ps(1-PU2))/n
and not 1 − (1 − PsPU2)/n
defined in the paper (because it takes the probability of the LOWER mean component, and not HIGHER mean.)
Or should Ms
actually be 1 - ME_final$ME_P_value
? In that case it will not match the paper either, I think ...
(2) In general, I am not sure I can understand the logic behind the expression for the Ms
score in the paper. High scores are likely to indicate true microexons, and by definition Ms = 1 − (1 − PsPu2)/n = 1 - 1/n + PsPu2/n
. Wouldn't you want Ms
decrease with increasing Ps
? The formula is the opposite. Also, if I understood correctly, n
is the actual number of microexon+splice sites exact matches in the given intron sequence. I would expect Ms
to grow for lower n
(n>0), i.e., decrease with n
increasing, But the formula is the opposite again. Shouldn't it be something like (1-Ps)Pu2/n
? Or if you take it as 1 - ME_final$ME_P_value
from the code it will be [1-(1-Pu2)Ps]/n
, which also would make sense to me. Am I missing the point completely?
Thank you