Get genes associated with pathway terms
ayeTown opened this issue · 12 comments
Hi there,
Two questions.
- Is there a way to obtain the genes for each of the pathway terms after running the analysis?
- After running the analysis I researched a couple of the terms and noticed they were associated with humans. However, I had run the analysis using mouse terms. Is this expected? Maybe the real issue is I don't understand the database well enough.
My pathways were obtained as follows:
pathways <- bind_rows(msigdbr(species = "mouse",category = "H"),msigdbr(species = "mouse",category = "C2",subcategory = "CP:REACTOME"),msigdbr(species = "mouse",category = "C3",subcategory = "TFT:GTRD"),msigdbr(species = "mouse",category = "C5"))
As I said previously some of the resulting terms after running the analysis are actually human terms.
Hi,
The genes and pathway names are all stored in your pathways object after you run format_pathways()
, so you can filter that to get your genes. But outside of that, there are a few ways you can get the genes. Just as a few small examples:
library(msigdbr)
library(dplyr)
library(magrittr)
library(SCPA)
# You can get them from the pathway list you created to do the analysis e.g.
pathways <- msigdbr("Mus musculus", "H") %>%
format_pathways()
names(pathways) <- sapply(pathways, function(x) x$Pathway[1]) # just to name the list, so easier to visualise
pathways$HALLMARK_GLYCOLYSIS$Genes
# You can get specific pathways from msigdbr e.g.
gly_genes <- msigdbr("Mus musculus", "H") %>%
filter(gs_name == "HALLMARK_GLYCOLYSIS") %>%
pull(gene_symbol)
# Or create a list of specific pathways from msigdbr e.g.
pathways <- c("HALLMARK_GLYCOLYSIS", "HALLMARK_COMPLEMENT")
path_genes <- msigdbr("Mus musculus", "H") %>%
filter(gs_name %in% pathways) %>%
select(gs_name, gene_symbol) %>% # select whatever columns you want here
group_split(gs_name) # this just creates a list for each pathway, so you can get the genes easily
For the second part, I'm not completely sure what you're meaning, but this is definitely from the MSigDB side -- SCPA just takes whatever pathway names and genes that you feed into it. Do you have an example of the pathways you're mentioning? MSigDB, which is the basis for msigdbr, was always based around human gene sets, and they recently added functionality to derive mouse equivalent gene sets (you can read about the specific mouse pathway implementations here). I assume what you're seeing are gene sets that have their gene symbols and names converted directly from the human equivalents
Jack
Thanks for responding so fast. So for the second question, below is a screenshot of the head of the results. The first pathway term is "ADNP_TARGET_GENES". If I look this up on the MSigDB side, the gene set is described as a human gene set: https://www.gsea-msigdb.org/gsea/msigdb/cards/ADNP_TARGET_GENES.html
So it looks like the msigdbr package was built on MSigDB v7.5.1 (see here), just before the addition of the mouse release -- I wasn't fully aware of this. msigdbr uses a few different sources to convert human genes to the mouse homologues (see here), which means that when you extract "mouse gene sets", it will be effectively the human gene sets with converted gene names, which is why you see human gene sets. If you wanted to use the mouse specific gene sets with SCPA you can always download the actual mouse gene sets from MSigDB mouse collections -- selecting the "gene symbols" version of the gmt file -- and use the .gmt filepath directly in the compare_pathways() function.
scpa_out <- compare_pathways(samples = list(pop1, pop2),
pathways = "gene_sets/mh.all.v2022.1.Mm.symbols.gmt")
Thinking about this... to make it easier, I'll add some more documentation to the SCPA gene set webpage about mouse gene sets and using .gmt files. I'll also likely add some functionality so you can specify the filepaths of multiple gmt files at the same time, which you can't currently do.
Okay, this sounds great! Thanks so much.
(just FYI... if you go this way, you currently need to delete the URL column from the .gmt file before you use it in SCPA)... not ideal, but I'm working on a reasonable way to avoid this
An update:
SCPA is now be able to handle raw gmt files downloaded straight from MSigDB -- reinstalling the latest v1.5.2 of SCPA. You can specify one or more filepaths of the gmt files that you want to use and SCPA will format and merge all the gmt files in your analysis. There's an example of how to do this in the gene set tutorials here. Hopefully that should make it easier if people want to use the updated human or mouse gmt versions
The gmt file list specification for the pathways works well for me. I was wondering if there is an easy way to add a column to the results which identifies which gmt file the gene set originated from.
I just used a modified version of the get_paths
function to obtain that information. Would be cool if you could add that in with the gmt file list option.
Yup, I can work on including that information as an optional output. Happy to look at including your function if you want to submit a PR in the meantime
Hey there @jackbibby1, I have a similar request! I appreciate you looking into this. SPCA is definitely the easiest pathway analysis I've done.
It's missing one particular feature from some other packages/programs, though: once you run the analysis, you can't look any deeper into what specific features impact the enrichment of a certain point.
Is there a way to include "impactful genes", or the specifically enriched genes/differential expressed genes within a pathway? It's definitely more complicated for SCPA than other subsets, because you include percentages of expression and the like, but without it it's harder to pin down the direct targets, for example, of a specific gene.
Is that something that might be doable?
Hi,
Yup, this is something that we've discussed for a while, but we haven't found a reasonable feature selection method that would be appropriate, given the complex analysis that SCPA performs. Like you mention, highlighting differentially expressed genes would be too simplistic, given that SCPA encompasses more than this.
This is part of a broader discussion on how biological pathways should be interpreted, but my general advice for approaching this is to filter out any genes of your pathway that aren't expressed, visualise the rest of the pathway (e.g. summarised heatmap), and then use your biological knowledge to guide the interpretation of which sets of genes are likely important e.g. apex transcription factors, critical enzymes, or just unexpected genes. In the end, given that pathways encompass the coordinate expression of many genes, which SCPA is trying to capture, I think this nuanced approach may give you a more complete picture of the pathway activity (even if it seems like more work upfront).
This is something that is always being discussed though, so I can let you know if there's anything implemented in the future.
Jack