lifebit-ai/dry-bench-skills-for-researchers

How do we pull data from NCBI GEO?

Opened this issue · 4 comments

Is it similar to using the wget function from Zenodo?

cgpu commented

@jjbivona You can use both wget if you know exactly what Dataset you are interested in, here is an example:

Let's say you want to access data from this project, that you found by navigating to NCBI.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68849

You can find in this pages the ftp links (very similar to http).
Here's how I retrieve this wget command for a file from this NCBI GEO dataset:

# in the terminal window type:
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE68nnn/GSE68849/suppl/GSE68849_non-normalized.txt.gz

We can also to do this from R, take a look at this forum answer:
https://www.biostars.org/p/335682/

I will make a note to add this in the wiki of the repository here, https://github.com/lifebit-ai/dry-bench-skills-for-researchers/wiki, thanks for pinging 👍 , this will be valuable for more people.

Thank you!!

I was able to do it using the wget function. I also tried using R, but the GEOquery package isn't updated to work with 3.6. Are there ways around this? It seems like a useful function to quickly pull the data and keep everything within R.

I also noticed that the file from GEO was a .gz file. I tried to get the first couple of lines using
head GSE68849_non-normalized.txt.gz
But nothing happens. I'm guessing it needs to be converted to .csv

cgpu commented

The .gz denotes that the file is compressed @jjbivona.

To decompress the retrieve .gz file

# In the command line
gunzip GSE68849_non-normalized.txt.gz

After that you can read it as is into R, no need to have it as csv as the function data.table::fread() is very welcoming to most formats, tsv, txt.

# In R
results <- data.table::fread(file = "GSE68849_non-normalized.txt")
head(results)

GEOqury installation

To install this Bioconductor library

follow the instructions in the page and copy the installation command:

https://bioconductor.org/packages/release/bioc/html/GEOquery.html

BiocManager::install("GEOquery", update = FALSE)

This worked for me in Lifebit CloudOS.

Works now! Thank you.

In the future if I get an out of date package from BiocManager will update = FALSE solve the problem?