public datasets for testing, benchmarking and comparison
Note: The PDF Supplemental tables provide an annotated list of GEO datasets for disease-specific and KO experiments. These can be processed by the protocol below to generate DE gene sets (see TSV files in subfolders).
- Go to https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE22873. Replace GEO ID.
- Expand Samples.
- Define two groups, e.g., cancer/normal or ko/wt.
- Assign samples to groups.
- Click on Top 250.
- Confirm directionality, i.e., gene with increased expression in cancer vs normal should have a positive log2FC.
- Click on Save all results.
- Save as TSV into appropriate subfolder.
- Download gene-attribute-matrix-standardized and attribute-list txt.gz files.
- Unzip files.
- Source
build-gene-sets.R
. - Run
buildGeoSets()
to generate a top-level TSV file compiled from each subfolder of collected datasets. - Run
buildHarmonizomeSets()
to generate a top-level TSV file compiled from downloaded files.
- Source 'query-gene-sets.R` to access functions to extract gene sets and associated data values per dataset.
listGeoSets(T)
will summarize the unique DE genes with FDR < 0.5 per dataset per diseasegenesByGeoSet("GSE6357", T, T)
will return a dataframe of DE genes with FDR < 0.05 and log2FC > 0listHarmonizomeSets()
will summarize unique DE genes with P-values < 0.5 per dataset per disease, providing a standardizedValue, "sv", from Harmonizome equal to -log10(p-value) * sign(log2FC).genesByHarmonizomeSet("GSE3467", T)
will return a dataframe of DE genes with sv > 0.