E3SM-Project/esgf_metrics

Add containerization, object-relational model (ORM) and database service for persistent data storage

Closed this issue · 0 comments

Currently, logs are parsed using a Python generator and stored in a pandas DataFrame.

Why we should store parsed logs in a SQL database:

  • avoid having to run the entire process of parsing logs unnecessarily, which takes time
  • output is columnar and structured
  • data is persistent

Maybe we should use docker-compose instead of having to install a SQL database natively.

Checklist

Containerization

  • Add Docker related files
    • Update structure of .env using /.envs directory
    • Add /docker directory to store docker service related files (e.g., Dockerfile)
    • Add docker-compose.yml
  • Install Docker and docker-compose on acme1
  • Start up docker-compose services
    • docker-compose up -d

Refactor esgf_metrics

  • Refactor existing data structures used for pandas DataFrames in class LogParser
    • Update order and names of dict keys columns

Object Relational Mapper

  • Add dependencies to conda-env (sqlalchemy and psycopg2-binary)
  • Add /database directory for database related Python modules
  • Add settings.py to store engine settings
  • Add models.py to store SQLAlchemy models
  • Add script to populate database and schema using models.py
    • This only needs to be run initially.

Automation

  • Add docker-compose service that runs a cron job which updates the database with new logs and updated metrics