Using coefficients to calculate pseudotime values for new data

Question

Using coefficients to calculate pseudotime values for new data

Closed this issue 5 years ago · 8 comments

I don't think there's a function baked into psupertime that automatically does it, but I'm trying to use the coefficients from the psupertime model to calculate pseudotime values for new data. The goal is to build the model from one set of data, but be able to give pseudotime values to data that wasn't included originally (eg. cells + some perturbation)

Based on the description in the paper (and what I can tease apart in the source code), it seems as if it should be as straight forward as doing matrix multiplication on the beta coefficients with the log-transformed expression matrix.

As a test, I've run psupertime on data, pulled the beta coefficients out of the object, and am trying to manually calculate pseudotime values from those. I get a decent correlation between the two, but it's not perfect. Am I missing something?

Run psupertime normally

# Note that "exp" is a matrix containing only HVGs
>psuper_obj <- psupertime(exp, timepoints, sel_genes="all", smooth=T,
                           penalization="best", min_expression=0)
>psuper_obj
psupertime object using 3133 cells * 3000 genes as input
    label ordering used for training: 0d, 8h, 1d, 3d, 7d
    genes selected for input: all
    # genes taken forward for training: 3000
    # genes identified as relevant: 443 (= 15% of training genes)
    mean training accuracy: 84%
    mean test accuracy: 87%

Grab coefficients ordered the same as HVGs
Because psupertime doesn't like hyphens, some gene symbols don't match up. Have to take a roundabout way to get the list

#This matches the order of VariableFeatures(), but has their renamed genes
input_genes <- rownames(psuper_obj$glmnet_best$beta)
#Remove four extra values at the end of the list that don't correspond to genes
#Don't know what cp1-4 are
input_genes <- input_genes[1:3000]
coef_values <- psuper_obj$beta_dt$beta
names(coef_values) <- psuper_obj$beta_dt$symbol
coef_values <- coef_values[input_genes]

pseudotime_values <- coef_values %*% exp

Then if I plot these values against the original pseudotime values from psuper_obj$proj_dt$psuper, I get a plot like this

Any thoughts?? I'm probably missing something obvious here.

David

Answer 1 · 2019-06-11T19:37:24.000Z

Hey @dpcook!

Cool that you're trying to use it for that purpose - that's one potential use which I had considered, where I think it could be really interesting.

As you've discovered, there are a couple of technical points involved when applying the coefficients learned from one dataset to another. The main one is scaling. I recommend running psupertime with scale=TRUE, so that each gene 'pays the same price' for coefficients (otherwise genes with larger variance would have outsize influence on the ordering, potentially drowning out interesting less variable genes). The default is to also do some denoising of the data, with the aim of making it easier to identify time-related genes.

The result of both of these is that the internal processed data used by psupertime to learn the coefficients has a slightly different distribution to the raw data, in particular it has different ranges.

This internal processed data is stored as psupertime$x_data. Here's a quick example doing what you were wanting to do:

# run standard simple example
library('psupertime')
data('acinar_hvg_sce')
psuper_obj = psupertime(acinar_hvg_sce, y=acinar_hvg_sce$donor_age, sel_genes='all')

# extract processed data
x_data = psuper_obj$x_data

# extract coeffs, put them in the same order as in x_data
beta = psuper_obj$beta_dt$beta
names(beta) = psuper_obj$beta_dt$symbol
beta = beta[colnames(x_data)]

# multiply, plot comparison
proj_new = x_data %*% beta
proj_old = psuper_obj$proj_dt$psuper
(qplot(x=proj_new, y=proj_old))

Answer 2 · 2019-06-11T19:40:24.000Z

The previous comment addressed the technical issue of why the coefficients weren't behaving as expected, but didn't actually help with your main goal...

To project new data onto a previously learned psupertime, I would recommend pre-processing all the data together before learning the first psupertime. You can then use a function in the package for doing the projecting, project_onto_psupertime. (However be warned I can't remember how thoroughly I've tested it!) And as you can see below, at the moment this is a bit hacky. Perhaps once you've had a bit of time to play around with it, you might have some suggestions for making this smoother.

So, assuming you have all data in exp, and some indices corresponding to the control and perturbed columns (idx_ctrl and idx_perturb), it might look like this:

# Pre-process all data together
sel_genes = rownames(exp)
temp_params = list(smooth=TRUE, scale=TRUE)
exp_processed = psupertime:::make_x_data(exp, sel_genes, temp_params)
exp_processed = t(exp_processed)

# Run psupertime on desired cells
psuper_obj = psupertime(exp_processed[, idx_ctrl], timepoints, sel_genes="all", smooth=FALSE,
                        scale=FALSE, penalization="best", min_expression=0)
# Project new data with learned coeffs
proj_new = project_onto_psupertime(psuper_obj, new_x=exp_processed[, idx_perturb],
                        new_y=perturb_ys, process=FALSE)

Hope that helps - happy to discuss further.
Will

Answer 3 · 2019-06-11T19:41:51.000Z

(One last minor comment re #Don't know what cp1-4 are: these are the 4 cutpoints learned by psupertime to separate your 5 timepoints)

Answer 4 · 2019-06-11T20:55:37.000Z

Hey Will--thank you for all the details! They definitely helped.

Running make_x_data first on the whole dataset and then psupertime only on the treated cells seems to solve everything. I can then take the coefficients from the psuper_obj and simply multiply it with exp_processed to get pseudotime values for all cells.

I tried project_onto_psupertime but got an error and didn't really know the cause:

proj_new <- project_onto_psupertime(psuper_obj, new_x=exp_processed[,perturb_idx],
                                    new_y=perturbed_labels, process=F)
Error in coeff_genes[both_genes, ] : invalid or not-yet-implemented 'Matrix' subsetting

But at least I can just take calculate pseudotime values using the coefficients end exp_processed

Answer 5 · 2019-06-12T10:25:54.000Z

Ok, glad that the workaround seems to work.

Re the error in project_onto_psupertime, I suspect there's a stray or missing transpose operation in there... Try this and let me know whether it works:

proj_new <- project_onto_psupertime(psuper_obj, new_x=t(exp_processed[,perturb_idx]),
                                    new_y=perturbed_labels, process=F)

Answer 6 · 2019-06-21T21:07:23.000Z

Sorry for the delay confirming this! Got caught up with some other analyses. Just revisited this and can confirm that it was a simple transposition issue, like you had mentioned.

Thanks for the help with this--it works perfectly!

Answer 7 · 2023-02-27T09:51:05.000Z

Hi, I was looking at this thread and thought I don't need to re-open a new issue. @dpcook it seems like you have found a workaround to project pseudotime onto a new data. Will you please share the codes/steps. It will be a valuable help as I am trying to do the same but can't figure out, how I should approach this.

Bests.
Nurun

Answer 8 · 2023-02-27T19:18:46.000Z

Hi @nfancy--been a couple years since I was messing around with this. This was the script I ended up running, but haven't checked if anything has changed since then:

# Projecting new data from seurat object onto previous model
# seurat_kinase$Drug is a metadata column with the specific perturbation--including controls
 exp <- as.matrix(seurat_kinase[["RNA"]]@data)
 temp_params <- list(smooth=TRUE, scale=TRUE)
 exp_processed <- psupertime:::make_x_data(exp, VariableFeatures(seurat_kinase),
                                            temp_params)
  
 new_psuper <- project_onto_psupertime(psuper_obj, new_x=exp_processed,
                                        new_y=seurat_kinase$Drug, process=F)
  rownames(new_psuper) <- colnames(seurat_kinase)