Data Observability for the AWS Glue ETL Pipeline

Question

Data Observability for the AWS Glue ETL Pipeline

arjun2189 opened this issue 3 years ago · 0 comments

Problem

To begin with not all companies may have a full grown datawarehouse and might use the datalake itself as a single place to start with. Our use case is kind of similar where s3 is our datalake, Glue jobs are our Transform step and then Athena is our query engine. AWS Glue provides its own monitoring dashboard but its only at the job levels, like how many jobs were run, how many successful and how many failed

Solution

It would be great to not only have the Job level metrics but also the Data level metrics, like counts of each table corresponding to a particular Glue job (if a table is exposed). All these can be easily pulled from the Glue Catalog Metadata. Was there any anomaly in the counts for the regular jobs. Some of the metrics can be exposed from your current solution, where we can have when was the last time the job was run/updated, table counts were updated etc
Glue job also comes with some metadata within itself like the number of workers used for a particular job, the timeout associated with it, Python version etc. Any way to observe that would also be a great addition.

Requirements

Any requirements that will be necessary for the feature to work.

Additional Context

Add any other context, screenshots, or related issues about the feature request here.

Questions, or need help getting started?

Feel free to ask below, or ping us on Slack