Automate metric collection and processing workflow
Closed this issue ยท 11 comments
For creating a self updating metrics dashboard, we will have to create a pipeline that runs everyday and updates the data reflected in the dashboard.
This pipeline will have following steps:
- Data collection from Gtihub and storage on ceph
- Metric calculation from raw Github data
- Update results csv on Ceph being used by Trino and Superset
I was playing around with Github API for python and that seems pretty straightforward as well. For example, see this snippet that gets open issues for the metrics repository:
Also, the mi tool uses github API under the hood, see file. It also has some metric processing classes that we can maybe reuse.
๐ Some notes that may help while creating the automated pipeline for metric collection.
@Shreyanand if I recall correctly, you will not have to actually make any "Modifications to the Trino table being used by Superset" just adding the the relevant metrics back to Ceph, should propagate the update through trino and superset
@Shreyanand if I recall correctly, you will not have to actually make any "Modifications to the Trino table being used by Superset" just adding the the relevant metrics back to Ceph, should propagate the update through trino and superset
That's even better! Updated.
I think the modification to the trino table needs to be triggered each time we want an updated table.
Meaning, updates to the ceph s3 dataset will also need to be followed by updates to the trino table created from the dataset.
Both steps can be a part of a same notebook though (getting data from s3 bucket, calculating and storing metrics on S3 and creating tables on trino).
In our ai4ci workflows, we add these steps to the same kubeflow pipeline so that they occur sequentially.
What is missing in this list, which we also need to automate is "refreshing the dataset" on superset upon a trino table update. That step is currently manual and can be only done by a dataset owner/creator. Refreshing the superset dataset will automatically update the charts and dashboard.
updates to the ceph s3 dataset will also need to be followed by updates to the trino table created from the dataset.
Is this true if there is no schema change or anything to an existing table?, just a new and correctly formatted csv added to ceph? Are the Trino tables no longer "external tables" the way we had done it a while back?
Is this true if there is no schema change or anything to an existing table?, just a new and correctly formatted csv added to ceph? Are the Trino tables no longer "external tables" the way we had done it a while back?
I remember the tables to be fairly static and that they needed refreshing each time new datasets would be pushed. But maybe this could depend on the way the tables or database is configured.
@hemajv do you know how we can configure trino tables to get updated automatically upon updates to an s3 prefix (eg: occurrence of new files within the prefix)
Is this true if there is no schema change or anything to an existing table?, just a new and correctly formatted csv added to ceph? Are the Trino tables no longer "external tables" the way we had done it a while back?
I remember the tables to be fairly static and that they needed refreshing each time new datasets would be pushed. But maybe this could depend on the way the tables or database is configured. @hemajv do you know how we can configure trino tables to get updated automatically upon updates to an s3 prefix (eg: occurrence of new files within the prefix)
I remember in the past, if new files were added to s3 it would get appended as new rows into the table (as long as columns remain the same). However, @oindrillac if you have seen that the table isn't getting appended then we may have to look into the table creation commands and figure out if there is another way. Perhaps it has changed with newer version of Superset/Trino. Ill look into it.
/assign @chauhankaranraj
@MichaelClifford As part of the automation, we were thinking of moving ahead with an argo workflow or cron job approach to serve our automation needs. Currently, we have only one notebook that needs to run in automation and kubeflow pipelines might be an overhead. The notebook can be executed daily to fetch the most updated data and analyze different repositories by passing as env vars the repo urls. Hence, we thought we could replicate a similar workflow as done in the mailing list analysis project.
@codificat @tumido @4n4nd We want to be able to pass a list of env vars to our cron job so that it can be run in parallel for the different env vars passed. Any suggestions/recommendations on how we could possibly do this?
For more context, we have a cron job that takes in a repository name as environment variable, and fetches metrics for it daily. Now, we want to run the same cron job for several repositories in parallel and we want to discuss how to achieve that. I found this online but I'm not sure if this is the best way.