bioinformatics-datasets

public datasets for testing, benchmarking and comparison

Note: The PDF Supplemental tables provide an annotated list of GEO datasets for disease-specific and KO experiments. These can be processed by the protocol below to generate DE gene sets (see TSV files in subfolders).

Collect Datasets

GEO Protocol

Go to https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE22873. Replace GEO ID.
Expand Samples.
Define two groups, e.g., cancer/normal or ko/wt.
Assign samples to groups.
Click on Top 250.
Confirm directionality, i.e., gene with increased expression in cancer vs normal should have a positive log2FC.
Click on Save all results.
Save as TSV into appropriate subfolder.

Harmonizome Protocol

Download gene-attribute-matrix-standardized and attribute-list txt.gz files.
Unzip files.

Compile Datasets

Source build-gene-sets.R.
Run buildGeoSets() to generate a top-level TSV file compiled from each subfolder of collected datasets.
Run buildHarmonizomeSets() to generate a top-level TSV file compiled from downloaded files.

Query Compiled Datasets for DE Genes

Source 'query-gene-sets.R` to access functions to extract gene sets and associated data values per dataset.
- listGeoSets(T) will summarize the unique DE genes with FDR < 0.5 per dataset per disease
- genesByGeoSet("GSE6357", T, T) will return a dataframe of DE genes with FDR < 0.05 and log2FC > 0
- listHarmonizomeSets() will summarize unique DE genes with P-values < 0.5 per dataset per disease, providing a standardizedValue, "sv", from Harmonizome equal to -log10(p-value) * sign(log2FC).
- genesByHarmonizomeSet("GSE3467", T) will return a dataframe of DE genes with sv > 0.

gladstone-institutes/bioinformatics-datasets

bioinformatics-datasets

Collect Datasets

GEO Protocol

Harmonizome Protocol

Compile Datasets

Query Compiled Datasets for DE Genes