This repo contains all the transfromation scripts and data for the mongo database for genome nexus.
There's a mongo docker container that has all the data imported. You can use the docker compose file in the genome nexus repo itself to start both the web app and the database: genome nexus.
Run the script scripts/import_mongo.sh. It will import files from export/:
./scripts/import_mongo.sh mongodb://127.0.0.1:27017/annotator # change accordingly
Ensembl Biomart file is required by the PFAM endpoint. In order to download this file follow these steps:
- Go to the Biomart page on the Ensemble website.
- Select
Ensemble Genes
from theDatabase
dropdown menu. - Select
Human Genes
from theDataset
dropdown menu. - Click on
Attributes
, and select these ones: Gene stable ID, Transcript stable Id, Protein stable Id, Gene name, Pfam domain ID, Pfam domain start, Pfam domain end. - Click on
Results
, and export all results to aTSV
file. - Copy over the downoaded file to replace pfamA.txt.
- Go to Biomart (grch37.ensembl.org/biomart/martview) page on the Ensemble website.
- Select
Ensemble Genes
from theDatabase
dropdown menu. - Select
Human Genes
from theDataset
dropdown menu. - Click on
Attributes
, and select these ones: Gene stable ID, Transcript stable Id, HGNC Symbol, HGNC ID - Click on
Results
, and export all results to aTSV
file. - Copy over the downoaded file to replace ensembl_biomart_geneids_grch37.p13.txt.
Some data is downloaded automatically through curl commands see data/Makefile. Change those if you want to change the data.
cd data
make all # takes about 30m from scratch