stephaniehicks/qsmooth

Question: how is Qhat calculated?

Closed this issue · 3 comments

I want to use qsmooth for a really large dataframe (think 300.000.000 values, 1.000 samples). But it seems that qsmooth does unfortunately not scale well for large dataframes. So I thought I will implement my "own" version in Python 🤓 . The code is very clear for the most part, but I don't understand what Qhat is supposed to represent and how it is calculated.

# Compute SSB
if(is.factor(group_factor)){
  X = model.matrix(~ 0 + group_factor)
} else {
  X = model.matrix(~ group_factor)
}

QBETAS = t(solve(t(X) %*% X) %*% t(X) %*% t(Q))
Qhat = QBETAS %*% t(X)

When I print Qhat for a variety of different dataframes it seems that it is "just" the rankmeans of the group factor a sample belongs to. Am I correct to understand it as such? What is QBETAS? I do not follow what that is or how it is calculated.

Hi @Maarten-vd-Sande,

First, thank you for your interest in qsmooth! Second, I'm incredibly sorry for taking so long to respond to this issue (pandemic parenting has been hard 😓). Third, I hope you were able to make progress on your implementation of qsmooth and not wait for me, but for completeness I'll gladly respond to your question.

In the qsmooth paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5862355/pdf/kxx028.pdf) (sorry for switching the notation too), we are interested in fitting a model (see the paragraph above Equation 2.1 in linked the PDF)

Q = X * \beta + \epsilon

where Q (or F^{-1)(u) in the paper) are the observed quantiles, X (or Z in the paper) is the discrete covariate representing biological group or condition, \beta is the regression coefficients representing the quantile reference distributions within each biological group at each quantile.

To estimate the regression coefficients (\beta), we use the least squares estimate (e.g. https://en.wikipedia.org/wiki/Linear_least_squares):

\hat{\beta} = (X^T X)^{-1} * X^T * Q

\hat{\beta} is the QBETAS you asked about above (i.e. estimated regression coefficients).

Next, if we want the predicted Q values, we would just multiple the X * \hat{\beta} or

\hat{Q} = X * \hat{\beta}

\hat{Q} is the Qhat you asked about above (i.e. the quantile normalized values within each biological group).

These quantities are used in Equation 2.1 (see PDF linked above) to calculate the within and between sum of squares.

I hope this helps!

Hi @Maarten-vd-Sande -- I'm going to close this issue, but please do comment if you have any other questions and I'll respond.

Thanks for your answer. Currently I am on holidays 🌴so I haven't had the time to properly look at it. I'll let you know if something remains unclear!