/bioinformatics-datasets

public datasets for testing, benchmarking and comparison

Primary LanguageR

bioinformatics-datasets

public datasets for testing, benchmarking and comparison

Note: The PDF Supplemental tables provide an annotated list of GEO datasets for disease-specific and KO experiments. These can be processed by the protocol below to generate DE gene sets (see TSV files in subfolders).

Collect Datasets

GEO Protocol

  1. Go to https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE22873. Replace GEO ID.
  2. Expand Samples.
  3. Define two groups, e.g., cancer/normal or ko/wt.
  4. Assign samples to groups.
  5. Click on Top 250.
  6. Confirm directionality, i.e., gene with increased expression in cancer vs normal should have a positive log2FC.
  7. Click on Save all results.
  8. Save as TSV into appropriate subfolder.

Harmonizome Protocol

  1. Download gene-attribute-matrix-standardized and attribute-list txt.gz files.
  2. Unzip files.

Compile Datasets

  1. Source build-gene-sets.R.
  2. Run buildGeoSets() to generate a top-level TSV file compiled from each subfolder of collected datasets.
  3. Run buildHarmonizomeSets() to generate a top-level TSV file compiled from downloaded files.

Query Compiled Datasets for DE Genes

  • Source 'query-gene-sets.R` to access functions to extract gene sets and associated data values per dataset.
    • listGeoSets(T) will summarize the unique DE genes with FDR < 0.5 per dataset per disease
    • genesByGeoSet("GSE6357", T, T) will return a dataframe of DE genes with FDR < 0.05 and log2FC > 0
    • listHarmonizomeSets() will summarize unique DE genes with P-values < 0.5 per dataset per disease, providing a standardizedValue, "sv", from Harmonizome equal to -log10(p-value) * sign(log2FC).
    • genesByHarmonizomeSet("GSE3467", T) will return a dataframe of DE genes with sv > 0.