Genome search backend microservice for ensembl 2020
This repo manages following endpoints:
- /api/genome_search/
- /api/alternative_assemblies/
- /api/popular_genomes/
- /api/genome/info/
- /api/genome/track_categories/
- /api/ensembl_object/
- /api/ensembl_object/track_list/
Python 3.7
Flask==1.0.2
Flask-Cors==3.0.7
Flask-RESTful==0.3.7
gunicorn==19.9.0
Jinja2==2.10
json-tools==0.4.1
jsonify==0.5
PyYAML==3.13
requests==2.21.0
urllib3==1.24.1
See requirements.txt for full list. Run following command to install all the requirements:
pip3 install --no-cache-dir -r requirements.txt
Data generation has two steps:
- Create Genome store file
- Index Genome store file
Location of Genome store file and Index file are configured in configuration file.
Genome store file holds all the required data of the genomes we use on Ensembl website. It uses Ensembl Metadata Registry as its primary data source. We could populate genome store either by fetching data by division or by individual genome or load from file. While fetching data, if genome store file already exists, it checks for genome ids in the existing genome store and does creation and updation accordingly.
Examples:
Fetch individual genomes:
python dump_species.py --fetch_by_genome Homo_sapiens Triticum_aestivum Caenorhabditis_elegans
Fetch by division:
python dump_species.py --fetch_by_division EnsemblVertebrates EnsemblMetazoa EnsemblPlants EnsemblFungi EnsemblProtists EnsemblBacteria
Load from file(useful when data is not present on metadata registry - example: GRCh37 data):
python dump_species.py --create_from_file /usr/src/genome-search/configs/grch37.json
If you created/updated only a few genomes in an existing genome store, you may want to know the genome store keys of those updated genomes. Use return_genome_store_ids
in such cases. This could be useful when you want to index only newly created/updated genomes.
Example:
python dump_species.py --fetch_by_genome Homo_sapiens Triticum_aestivum Caenorhabditis_elegans --return_genome_store_ids
We could either index the whole genome store at a time or only a few genomes from genome store if needed.
At the moment, only common name, scientific name and assembly name of a given genome are indexed. This can be changed in configuration yaml file.
Indexing is done by using edge n-gram tokens with a minimum token length of 3.
Run indexer as follows. To index whole genome store:
python index_species.py
To index only a few genome store entries:
python index_species.py --index_genome_store_ids <ID_LIST>
Genome store file and Index file must be generated before starting the server as they serve as primary data source for the application.
python app.py
gunicorn app:app --workers 5 --preload
All the tests are located in tests directory and can be run using the following command:
python -m unittest discover -s tests