Filtering: Possible bug and a question

Question

Filtering: Possible bug and a question

Opened this issue 3 years ago · 0 comments

Hi,

(1) I think the paper Methods and the actual code disagree in calculating the Ms score. In the paper the text reads

... we calculate a score, Ms, for each putative microexon as Ms =1 − (1 − PsPU2)/n, where PU2 is the probability that the observed U2 score came from the Gaussian with the higher mean and n is the number of matches for a given intron.

The corresponding calculation in the code seems to be (from https://github.com/hemberg-lab/MicroExonator/blob/master/src/final_filters3.R, lines 85-95)

fit_U2_score <- normalmixEM(ME_matches_filter$U2_score, maxit = 10000, epsilon = 1e-05)
#ggplot_mix_comps(fit_U2_score, "Mixture model Micro-exon >=3 after coverge filter")
post.df <- as.data.frame(cbind(x = fit_U2_score$x, fit_U2_score$posterior))

#ME_final <- ME_centric_raw[ME %in% uniq_seq_filter & len_micro_exon_seq_found>=3, ]
if(fit_U2_score$mu[1]<=fit_U2_score$mu[2]){
  ME_final$ME_P_value <-  1 - (1 - approx(post.df$x, post.df$comp.1, ME_final$U2_scores)$y * ME_final$P_MEs) / ME_final$total_number_of_micro_exons_matches
} else {
  ME_final$ME_P_value <-  1 - (1 - approx(post.df$x, post.df$comp.2, ME_final$U2_scores)$y * ME_final$P_MEs)/ ME_final$total_number_of_micro_exons_matches
}

So if ME_final$ME_P_value should be identified with Ms the code calculates 1 − (1 − Ps(1-PU2))/n and not 1 − (1 − PsPU2)/n defined in the paper (because it takes the probability of the LOWER mean component, and not HIGHER mean.)

Or should Ms actually be 1 - ME_final$ME_P_value? In that case it will not match the paper either, I think ...

(2) In general, I am not sure I can understand the logic behind the expression for the Ms score in the paper. High scores are likely to indicate true microexons, and by definition Ms = 1 − (1 − PsPu2)/n = 1 - 1/n + PsPu2/n. Wouldn't you want Ms decrease with increasing Ps? The formula is the opposite. Also, if I understood correctly, n is the actual number of microexon+splice sites exact matches in the given intron sequence. I would expect Ms to grow for lower n (n>0), i.e., decrease with n increasing, But the formula is the opposite again. Shouldn't it be something like (1-Ps)Pu2/n? Or if you take it as 1 - ME_final$ME_P_value from the code it will be [1-(1-Pu2)Ps]/n, which also would make sense to me. Am I missing the point completely?

Thank you