RNA-seq: Pathway Analysis - GSVA
cansavvy opened this issue ยท 9 comments
What are the goals of this new example analysis?
We have a GSVA example for microarray, but should create one for RNA-seq.
What kind of dataset will this need?
We need an RNA-seq dataset that we can normalize before running GSVA
What steps should be included in this analysis?
I think most of the GSVA steps from microarray can stay the same with these exceptions I can tell as of now:
- I think we'll need to use
DESeq2
andvst()
normalized data. - The
Handling duplicate gene identifiers
strategy may need to be different (currently its pick the max value for each sample). - It's unclear to me if/which parameters for the
gsva()
run inPerform GSVA
should be changed -- something to look into.
Between data types, the main parameter you might want to change is the kcdf
argument but if using transformed RNA-seq data it should be the same if I recall correctly.
When this issue is addressed, note that the intro paragraph from #349 will need to be added here, and the table will need to be made to reflect the RNA-seq versions of the analyses.
For handling the duplicate identifiers, for GSVA, I'm not sure mixing values for different Ensembl IDs in RNA-seq makes as much sense as it did for Microarray.
Should I instead switch to something where we pick one Ensembl ID's values over the other(s) -- take them as a set? (Based on bigger average or bigger variance?)
@cansavvy I would take a look at what was originally in #352 and what is in this comment #352 (comment)
My guess since you referenced that comment is yes, we still agree think this makes sense since its still per-sample basis?
A few things we discusssed over video chat that I'm going to change in the draft. These things should help shorten up the notebook (which is currently ~800 lines).
I've taken the general outline from my draft PR and made what are the edits to it:
EDITED:
- Set up data
- Filter out low counts
- DESeq2 normalize and transform (use vst)
- Still Hallmark pathways only
- Do gene ID convert to Entrez IDs
- Resolve multi mapped Entrez IDs
same way as beforeselect by a Ensembl ID by max average - Use normalized data for GSVA
Differential expression with limma- Move metadata label cleaning to down here!
Make sina plot of most DE pathwayHeatmap of pathways (without any DE)- Save plot, session info
Hallmark pathways onlyUse all of them!
I disagree with this - there are only 50 hallmark gene sets so you can put them all in a heatmap.
I'm going to close this issue. Any changes to this example will come about as part of #371.