Database Management for the pipeline.
grabear opened this issue · 8 comments
- Database Management will first be developed in Manager/database_management.py
- Another module for creating template BioSQL databases will be developed in Manager/BioSQL/biosql.py
- It will help keep the following databases updated:
- ETE3's NCBI-taxonomy database
- Local NCBI databases
- blast database /blast/db NEW (May 2019)
- GenBank flat files from NCBI's Refseq release (BioSQL) /refseq/release/<collection_subset>
[ ] gi lists OR Should we convert this to accession.version via this or thisvertebrate_mammalian
@grabear status? Closeable? lol
Not yet lol. @sdhutchins
Update on this issue.
Scope
- Manager/BioSQL/biosql.py
- Manager/database_management.py
- Manager/database_dispatcher.py
- Manager/utils.py
- Manager/config/yml/database_config.yml
- Manager/db_mana_test.py
Tested Functionality
The following checked items have been tested by changing the parameters in the config file.
- Logging is looking amazing.
- YAML config file format (database_config.yml)
Configuration
- Dispatching database management tasks (database_dispatcher.py via database_management.py)
- BioSQL creation (biosql.py)
- SQLite
- BioSQL template creation with schema and NCBI taxonomy
- MySQL
- BioSQL template creation with schema and NCBI taxonomy
- SQLite
- NCBI tasks
- blast downloading
- db downloading
- windowmasker files downloading
- pub taxonomy downloading
- refseq release
- downloading
- uploading to BioSQL
- blast downloading
- ITIS downloads
- BioSQL creation (biosql.py)
- Configuration on a per project basis
Archiving
Bugs need to be fixed with the file movement and deletion after archiving.
- Archiving
Deletion
Not Tested
Config File Explanation and Preview
The config file is loaded into Python as a nested dictionary. The top key value pairs such as:
email: "rgilmore@umc.edu"
driver: "sqlite3"
are used for changing the parameters in the BaseDatabaseManagement class.
The various strategies for dispatching tasks include the following and are dictionary keys:
['Full', 'Projects', 'NCBI', 'NCBI_blast', 'NCBI_blast_db', 'NCBI_blast_windowmasker_files', 'NCBI_pub_taxonomy', 'NCBI_refseq_release', 'ITIS', 'ITIS_taxonomy']
Some keys are nested in the config file. The concept to note here is that top level keys (or strategies) have flags that control any sub level strategies. So if the configure_flag for 'Full' is True, then the configure_flag for 'Projects', 'NCBI', 'NCBI_blast', 'NCBI_blast_db', 'NCBI_blast_windowmasker_files', 'NCBI_pub_taxonomy', 'NCBI_refseq_release', 'ITIS', and 'ITIS_taxonomy' will also be interpreted as True when the database functions are dispatched.
Below I've added a preview of the entire database_config.yml file for consideration of the above statements.
Database_config:
email: "rgilmore@umc.edu"
driver: "sqlite3"
Full:
configure_flag: False
archive_flag: False
delete_flag: False
project_flag: False
_path: "!!python/object/apply:pathlib.Path ['']"
Projects:
Project_Name_1:
configure_flag: True
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['Project_Name_1']"
Project_Name_2:
configure_flag: True
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['Project_Name_2']"
Project_Name_3:
configure_flag: True
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['Project_Name_3']"
NCBI:
configure_flag: False
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['NCBI']"
NCBI_blast:
configure_flag: False
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['NCBI', 'blast']"
NCBI_blast_db:
configure_flag: False
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['NCBI', 'blast', 'db']"
NCBI_blast_windowmasker_files:
configure_flag: False
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['NCBI', 'blast', 'windowmasker_files']"
taxonomy_ids: ""
NCBI_pub_taxonomy:
configure_flag: True
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['NCBI', 'pub', taxonomy']"
NCBI_refseq_release:
seqtype: "rna" # Other seqtypes are protein and genomic
seqformat: "gbff"
collection_subset: "vertebrate_mammalian"
configure_flag: False
archive_flag: False
delete_flag: False
upload_flag: False
_path: "!!python/object/apply:pathlib.Path ['NCBI', 'refseq', 'release']"
upload_list: [1,2,3,4,5,6,7,8,9,10]
ITIS:
configure_flag: True
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['ITIS']"
ITIS_taxonomy:
configure_flag: True
archive_flag: False
delete_flag: False
_path: "!!python/object/apply:pathlib.Path ['ITIS', 'taxonomy']"
Current ToDo List:
- Resolve dependency issues (sqlalchemy, tqdm, luigi, sciluigi,) (other?)
- Make sure ete3 taxdump files go to the proper database folder
comparative_genetics - line 240 ish # Load taxon ids from a local NCBI taxon database via ete3 ncbi = NCBITaxa()
- Create class for managing sqlite3 database including the duplicates and missing values
- Resolve None/Nan values in accession data
- Make sure the database_dispatcher is creating the proper sub directory for NCBI_refseq_release (e.g. vertebrate_mammalian)
Fix the NCBI_refseq_release database_management functionality:
OrthoEvolution/OrthoEvol/Manager/database_management.py
Lines 505 to 543 in a487971
One question as I test this out, @grabear:
- How would I use an existing database (refseq) or do we need to create an if statement that compares size of current refseq path (if it exists) to ftp file path?
@sdhutchins
Sorry I missed this...
I don't quite understand your question though. Are you asking how do we know if our data is up to date?
Do you still need help with this?
Things to do:
- New blast database
- Make sure that ete3's
NCBITaxa()
call is using the file we manually download viaDatabaseManagement
NCBITaxa(taxdump_file="out_path/taxdump.tar.gz")
- Move the
Template-BioSQL-SQLite.db
to the top level of repositories - Consider moving
refseq_release
databases as well. If implemented we could copy/paste them for sqlite. And then delete them after the pipeline use. - Try to work on MySQL or PG