Assessing the quality of experiments, both directly and through replication attempts, is becoming a grave concern in biomedicine. This is especially true in omics experiments, where thousands of independent measurements are done in parallel, with the understanding that only a small minority will probe scientifically meaningful effects – a state of affairs conductive to mistaking false positives for scientific discoveries.
Here, we assessed the quality of the high-throughput sequencing experiments submitted to Entrez GEO database until 2020-12-31.
NCBI GEO datasets queries were performed using Bio.Entrez python package and by sending requests to NCBI Entrez public API. The query was 'expression profiling by high throughput sequencing[DataSet Type] AND ("2000-01-01"[PDAT] : "2020-12-31"[PDAT])'. FTP links from GEO datasets document summaries were used to download supplementary files lists. Supplementary files were filtered for downloading, based on file extensions, to keep file names with "tab", "xlsx", "diff", "tsv", "xls", "csv", "txt", "rtf", and "tar" file extensions. We dropped files whos names contained regular expression "filelist.txt|raw.tar$|readme|csfasta|(big)?wig|bed(graph)?|(broad_)?lincs", because we were not expecting to find P values from these files. Downloaded files were imported using python pandas package, and searched for unadjusted P value sets. Unadjusted P value sets and summarised expression level of associated genomic features were identified using column names. P value columns from imported tables were identified by using regular expression "p[^a-zA-Z]{0,4}val", adjusted P value sets were identified using regular expression "adj|fdr|corr|thresh" and omitted from furter analysis. Expression levels of genomic features were identified by using following regular expressions: "basemean", "value", "fpkm", "logcpm", "rpkm", "aveexpr". When multiple expression level columns were present in a table, then average expression level was calculated for each feature.
Identified raw P value sets were classified based on their histogram shape.
Histogram shape was determined based on the presence and location of peaks.
P value histogram peaks (bins) were detected using a quality control threshold described in [1], a Bonferroni-corrected
- To get started you need to download and install miniconda3 and create conda environment with snakemake
- Go to https://docs.conda.io/en/latest/miniconda.html for download and installation instructions of miniconda3
- Create conda environment with snakemake, essentially as described in https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
conda create -c bioconda -c conda-forge -n snakemake snakemake
- (Fork and) clone this repository
git clone https://github.com/rstats-tartu/geo-htseq.git
- cd to working directory and activate conda environment
cd geo-htseq
conda activate snakemake
- Dry run workflow
snakemake -n
-
To run the workflow you need to set up
- NCBI API key as NCBI_APIKEY environment variable
- Elsevier API key as ELSEVIER_GEOSEQ environment variable
-
Run workflow.
snakemake --use-conda -j
- Taavi Päll taavi.pall@ut.ee
[1] Breheny, P., Stromberg, A., & Lambert, J. (2018). p-Value Histograms: Inference and Diagnostics. High-throughput, 7(3), 23. https://doi.org/10.3390/ht7030023
[2] J. D. Storey and R. Tibshirani. Statistical significance for genome-wide experiments. Proceedings of the National Academy of Sciences, 100:9440–9445, 2003. https://www.pnas.org/content/100/16/9440