slowkow/harmonypy

Question re variance explained from harmony-corrected components

Closed this issue · 4 comments

Hi there,
thanks for implementing harmony for python - worked fine for me! I was wondering if there was any more information about how to get the variance explained for each harmony components (similar to sd in Seurat objects)? This issue was solved for the R version here, but I was wondering if there is a way to get this information from the python output object? Sorry if this is obvious but I couldn't find details about the output object or how to get this from the output.
Thanks a lot,
Kathrin

Hi Kathrin, thanks for the question.

It sounds like you are giving the PCA coordinates of cells as input to Harmony. Is that right? If so, please see the documentation for your PCA function to get the variance explained for your PCs.

For example, you might be using the PCA function from the sklearn Python package:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Here is a detailed tutorial that shows how to use that function and find the variance explained:

https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

In the issue you linked to, Ilya mentions that he added standard deviations of Harmony output to the Seurat V3 object. Here's the code for that:

https://github.com/immunogenomics/harmony/blob/88b1e2a1e95306cd4531eaf89ad98712af05ca99/R/RunHarmony.R#L149-L154

    harmonydata <- Seurat::CreateDimReducObject(
      embeddings = harmonyEmbed,
      stdev = as.numeric(apply(harmonyEmbed, 2, stats::sd)),
      assay = assay.use,
      key = reduction.save
    )

You can see that it's applying the sd() function to each column of the Harmony output matrix.

Maybe that's what you're looking for?

In Python, you can use numpy to get the standard deviation of each column of a matrix:

https://numpy.org/doc/1.18/reference/generated/numpy.std.html

Thanks for the quick and detailed answer! I will calculate it using some of your suggestions.
May I ask your opinion on something related to this? Would you usually select the number of PCAs to use before running harmony (e.g. do an elbow plot) or is it acceptable to do a component selection post running harmony?

I like to run multiple analyses with different parameters to get some sense of how they can affect the final results. The methods section of published papers can occasionally be helpful to see what strategies people use to choose parameters.

For general questions, you might consider asking for analysis advice on Biostars.

Thanks for the answer, sorry if I hadn't phrase the question very clearly: it was specifically about harmony component selection. No worries though, you've answered my original question. I'll close this now, thanks again for the great tool & your help!