Detailed steps needed to have a local development version of BioGPS, for dataset loading
Make sure you use git for version control (May 2016 Biogps_dataset was migrated to Github)
git clone https://github.com/JTFouquier/biogps_dataset.git
virtualenv biogps
source biogps/bin/activate
pip install -r requirements.txt
Because the dataset database is much too large to install on computer for local development, you need to request a connection to our dev db server
The settings_dev file is a "secret file." Please see Chunlei or BioGPS project manager.
python3 manage.py runserver_plus --settings=biogps_dataset.settings_dev
Install Elastic Search using these directions:
Elastic search is a search server based on Lucene.
It is a full-text search engine with an HTTP web interface and schema-free JSON documents.
Elastic search is developed in Java and is released as open source under the terms of the Apache License.
From within the elasticsearch folder that you set up, run:
./bin/elasticsearch
You will need to get an info sheet, factors sheet and RNAseq data/matrix file from a scientist.
If yes, then you must run reporter_to_entrezgene.py
, which will use mygene.info to replace gene symbols with Entrezgene IDs.
Entrezgene IDs are absolutely necessary for Biogps.org data display, but for microarray datasets, keep the probe set reporters.
- load_ds command which will load remote datasets (Microarray data from ArrayExpress) to remote server for dev or prod.
- load_ds_local command will load local datasets to the remote server for dev or prod. (written for RNAseq)
Run the command like this using Django manage.py, where "load_ds_local" can be other commands:
python3 manage.py load_ds_local --settings=biogps_dataset.settings_dev
Then you must use the command es_index to "index the data", then the newly loaded dataset should appear in the chart file:
python3 manage.py es_index --settings=biogps_dataset.settings_dev
Use the "-c" argument if you want to clear previous indexingpython3 manage.py es_index -c --settings=biogps_dataset.settings_dev
Output looks something like this:
added 16 platform, added 5914 dataset
Must sometimes restart the localhost and server that is containing the database, as well as elasticsearch.
For help:
python3 manage.py load_ds --help --settings=biogps_dataset.settings_dev
If you don't know what a model is, then read about Django!
-
dataset:
- Model with information about a certain dataset including metadata.
-
dataset_matrix:
- is the dataset matrix that contains the entire dataset from the RNA seq run. Meaning, you likely do not want to display an instance of this model all at once!
-
dataset_data:
- is one reporter gene, and all of it's expression information for all samples.
-
Dataset Platform:
-
We created a new platform since now we're loading a sequencing (not microarray) dataset. This is a sequencing platform, so does not have to be recreated every time.
-
Example input information:
- RNA seq
- reporters empty list
- name = "generic RNA seq platform for mouse"
- species = mouse
-
urls from mygene.info used to get the Entrezgene IDs from gene symbol (from reporter_to_entrezgene):
http://mygene.info/v2/query?q=symbol:CDK2
http://mygene.info/v2/query?q=symbol:0610005C13Rik
python3 manage.py shell_plus --settings=biogps_dataset.settings_dev
This returns the dataset object which is the foreign key for dataset data and dataset matrix:
ds = BiogpsDataset.objects.get(geo_gse_id="BDS_00015")
This returns all the metadata (from info sheet and factors):
ds.__dict__
Dropdown menu in "probeset" is also considered the reporter gene on BioGPS
Go to the URL for the specific gene and dataset name (primary key of dataset or geo_gse_id) geo_gse_id is also important: will be BDS_XXXXX next number in sequence)
Example dataset viewing urls:
http://localhost:8000/static/data_chart.html?gene=67669&dataset=10044
http://localhost:8000/static/data_chart.html?gene=12566&dataset=BDS_00015
http://localhost:8000/static/data_chart.html?gene=100152011&dataset=10078
Example admin:
http://localhost:8000/admin/dataset/biogpsdataset/2427/
Standard test gene is 1017, which is a human gene! So if you are using a mouse dataset, this will understandably be missing:
CDK2 cyclin-dependent kinase 2, Homo sapiens (human) Gene ID: 1017, updated on 6-Mar-2016
Cdk2 cyclin-dependent kinase 2, Mus musculus (house mouse) Gene ID: 12566, updated on 6-Mar-2016
You can also check the "fixed reporters" data file to see which Entrezgene IDs are actually in your dataset for viewing.
http://localhost:8000/dataset/full-data/geo_gse_id%20test/gene/12566/
http://localhost:8000/dataset/full-data/E-GEOD-16054/gene/1017/
http://localhost:8000/dataset/full-data/BDS_00001/gene/1017/
If so, then change the color_idx in the json metadata (ex: admin/dataset/biogpsdataset/2509/) accordingly to group samples into meaningful groups. This is done manually due to the numerous variations of possible sample groupings