How do we pull data from NCBI GEO?
Opened this issue · 4 comments
Is it similar to using the wget function from Zenodo?
@jjbivona You can use both wget if you know exactly what Dataset you are interested in, here is an example:
Let's say you want to access data from this project, that you found by navigating to NCBI.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68849
You can find in this pages the ftp links (very similar to http).
Here's how I retrieve this wget command for a file from this NCBI GEO dataset:
# in the terminal window type:
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE68nnn/GSE68849/suppl/GSE68849_non-normalized.txt.gz
We can also to do this from R, take a look at this forum answer:
https://www.biostars.org/p/335682/
I will make a note to add this in the wiki of the repository here, https://github.com/lifebit-ai/dry-bench-skills-for-researchers/wiki, thanks for pinging 👍 , this will be valuable for more people.
Thank you!!
I was able to do it using the wget function. I also tried using R, but the GEOquery package isn't updated to work with 3.6. Are there ways around this? It seems like a useful function to quickly pull the data and keep everything within R.
I also noticed that the file from GEO was a .gz file. I tried to get the first couple of lines using
head GSE68849_non-normalized.txt.gz
But nothing happens. I'm guessing it needs to be converted to .csv
The .gz
denotes that the file is compressed @jjbivona.
To decompress the retrieve .gz file
# In the command line
gunzip GSE68849_non-normalized.txt.gz
After that you can read it as is into R, no need to have it as csv as the function data.table::fread()
is very welcoming to most formats, tsv, txt.
# In R
results <- data.table::fread(file = "GSE68849_non-normalized.txt")
head(results)
GEOqury installation
To install this Bioconductor library
follow the instructions in the page and copy the installation command:
https://bioconductor.org/packages/release/bioc/html/GEOquery.html
BiocManager::install("GEOquery", update = FALSE)
This worked for me in Lifebit CloudOS.
Works now! Thank you.
In the future if I get an out of date package from BiocManager will update = FALSE solve the problem?