ncihtan/data-models

Sync released data models to BQ tables

Opened this issue · 4 comments

As a data manager for HTAN I would like our internal BigQuery `htan-dcc:metadata tables to include

  • data-model_main - a accurate reflection of the main branch of the data model
  • data-model_vYY.MM.minor: a table for every version of the data model
  • data-model_latest a table that reflects the latest released version of the data model

This allows us to ensure that we can use this in queries against our submitted manifests or other information held in BigQuery

We can extend the bq-schema workflow as follows

Add running when a release is created

on:
  push:
    branches: main
    paths: 'HTAN.model.csv'
  release:
    types: [created]
  workflow_dispatch: 

Add a job to create a versioned table if the event name is release

  add-versioned-table:
    name: Add versioned schema to BQ
    runs-on: ubuntu-latest
    needs: add-to-bq
    if: github.event_name == 'release'

Then duplicate the versioned table as latest

      - name: Duplicate versioned table as latest
        shell: bash
        run: |
          VERSION=${{ github.event.release.tag_name }}
          bq cp htan-dcc:metadata.data_model_${VERSION} htan-dcc:metadata.data_model_latest

Please add a "critical" label if expected within phase 1.0. Or a "renewal" label if this can wait.

Need to discuss with ISB during data flow discussions. @aclayton555 tag in flow diagram. Need to understand how users are engaging with BQ

Currently, there is workflow there is workflow to sync the staging version of the data model with the BQ tables. Used to help populate attribute description. This is still currently in use, but @PozhidayevaDarya has not run it for a little while. TBD on updating this to include the attributes listed here AND update this so that it is no longer pointing to staging.

24-8 Close-out: take this into consideration in the data model design doc. Need to understand what is needed here and the needed architecture.