Configure Google Cloud Storage
callahantiff opened this issue · 3 comments
TASK
Task Type: PKT DATA DELIVERY
Use Google Cloud Storage in build to store each build's downloaded data and output knowledge graphs
- Data used for each build
- Built KGs
- Built Embeddings
TODO
- Create GCS Bucket for builds in GCP PheKnowLator bucket (details here)
- Apply Object Lifecycle Management to GCS bucket
-
Create script similar togoogle_cloud_storage_downloader.py
that can provides API access to Google Cloud Storage
Resources:
This is what I am proposing for organizing the GCS bucket:
GCS bucket root/
|---- pheknowlator/
| |---- release_v.1.0/ ...
| |---- release_v.2.0/
| | |---- *build_<<date>>/
| | | |---- data/
| | | | |---- original_data/
| | | | |---- processed_data/
| | | |---- knowledge_graphs/
| | | | |---- subclass_builds/
| | | | | |---- relations_only/
| | | | | | |---- owl/
| | | | | | |---- owlnets/
| | | | | |---- inverse_relations/
| | | | | | |---- owl/
| | | | | | |---- owlnets/
| | | | |---- instance_builds/
| | | | | |---- relations_only/
| | | | | | |---- owl/
| | | | | | |---- owlnets/
| | | | | |---- inverse_relations/
| | | | | | |---- owl/
| | | | | | |---- owlnets/
| | |---- *build_<<date>>/ ...
For release_v.1.0
data, I will update it once this work is complete and add files from past builds so that I am no longer responsible for maintaining them via my DropBox.
*meant to symbolize each monthly build
GCS Permissions Setting: I was thinking of setting the bucket pheknowlator
directory and all subsequent directories as nearline to start and once we know what the usage pattern is we can adjust it.
@bill-baumgartner - What do you think about this plan?
I like the directory structure. Do you expect to host other data aside from KG builds here? If not, then the knowledge_graph_builds/
directory is probably not required.
It looks as though we can specify the storage class on a per-object level, and we can use Object Lifecycle Management rules to change the storage class over time. So, a newly built KG could use the Standard Storage
class initially, and then be downgraded to Nearline Storage
after a period of time, e.g. 30 or 60 days.
I like the directory structure. Do you expect to host other data aside from KG builds here? If not, then the
knowledge_graph_builds/
directory is probably not required.
I was thinking about that too. I'm guessing not, if we wanted to store a primary docker container or something like that, it wold likely be at the release-level. I will modify the figure to remove the knowledge_graph_builds/
directory.
It looks as though we can specify the storage class on a per-object level, and we can use Object Lifecycle Management rules to change the storage class over time. So, a newly built KG could use the
Standard Storage
class initially, and then be downgraded toNearline Storage
after a period of time, e.g. 30 or 60 days.
Awesome, this is perfect.
✔️ I will go ahead and get this set-up now. It will allow me to start modifying/creating code we will need to support the three-task build plan we discussed yesterday.