I digged into the code to see why I cannot reproduce stats::wilcox.test() result
mvfki opened this issue · 0 comments
Hi team,
Thanks for the incredibly fast implementation. I have been trying to clone the repo and learn from the code. When I was trying to extend it from a default two-sided hypothesis to a one-sided hypothesis, I found something potentially mistaken.
For the hypothesis extension, it's rather simple by switching a correction value and pnorm()
call in compute_pval()
, referring to R's implementation. However, when I'm testing it with a sparse matrix where one of the features/genes is all zero, I found the result pretty odd -- It got a <1 p-value! So I digged into the code and suspected the ties
returned from Rcpp side lacks the counting on zeros, which I believe is intentional for sparse data. I can see that the ties on zeros are easily addressed when calculating the ustats
, but I never found it even mentioned when you calculate p-values. With suspecting that this is the cause, I modified my the Rcpp code in my copy:
src/fast_wilcox.cpp function cpp_rank_matrix_dgc
std::vector<std::list<float> > cpp_rank_matrix_dgc(
arma::vec& x, const arma::vec& p, int nrow, int ncol) {
vector<list<float> > ties(ncol);
int n_zero;
for (int i = 0; i < ncol; i++) {
n_zero = nrow - (p[i+1] - p[i]);
if (p[i+1] == p[i]) {
ties[i].push_back(n_zero);
continue;
}
ties[i] = cpp_in_place_rank_mean(x, p[i], p[i + 1] - 1);
ties[i].push_back(n_zero);
x.rows(p[i], p[i + 1] - 1) += n_zero;
}
return ties;
}
and also in function cpp_in_place_rank_mean
, it seems the lastly counted ties are also not pushed back as well.
// code originally between line 125-143
for (i = 1U; i < v_sort.size(); i++) {
if (v_sort[i].first != v_sort[i - 1].first) {
// .....
} else {
// .....
}
}
// Things look good until now but I guess we need to add the following line
// right after the loop ends
if (n > 1) ties.push_back(n);
It's very easy to test with something like
vec <- rep(0:2, 5)
y <- factor(rep(c(1,2), c(7,8)))
sparse <- as(t(vec), "CsparseMatrix")
presto::wilcoxauc(sparse, y)
feature group avgExpr logFC statistic auc pval padj pct_in pct_out
1 Feature1 1 0.8571429 -0.2678571 23 0.4107143 0.5888988 0.5888988 57.14286 75.00000
2 Feature1 2 1.1250000 0.2678571 33 0.5892857 0.5888988 0.5888988 75.00000 57.14286
wilcox.test(vec[1:7], vec[8:15])$p.value
[1] 0.581541
And if I apply the modification I described above, we can then have
wilcoxauc(sparse, y)
feature group avgExpr logFC statistic auc pval padj pct_in pct_out
1 Feature1 1 0.8571429 -0.2678571 23 0.4107143 0.581541 0.581541 57.14286 75.00000
2 Feature1 2 1.1250000 0.2678571 33 0.5892857 0.581541 0.581541 75.00000 57.14286
OK, all of these are based on that I want the function to be able to replicate what stats::wilcox.test()
can produce. If it was on purpose to ignore the ties on zeros and the last happening ties when calculating p-values, I would really love to learn about the reason. But overall I appreciate it so much that you have made this package!
-Yichen