AbsaOSS/pramen

Add the ability to store arbitrary metadata for metastore tables.

Closed this issue · 2 comments

Background

This is needed to remember a state between pipeline operations, such as accuracy and control metrics.

Feature

Add the ability to store arbitrary metadata for metastore tables.

Example

def run(metastore: MetastoreReader, ... ) = {
  metastore.setMetadata("my_table", infoDate, "my_metric", "43")

  val valueOpt = metastore.getMetadata("my_table", infoDate, "my_metric")
}

Proposed Solution

Extend metastoreReader with 'getMetadata/setMetadata' methods that allow storing metadata for tables and info dates. Persisi metadata in the bookkeeping database.

Maybe you like the solution more if you use the native schema Field metadata [1]. You can bind the metadata to each column, without a need to use the bookkeeper. If you'd like to have a df level metadata, you can set it to some service column.

[1] https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withMetadata.html

Schema metadata might be something to have as well. However, in this case I want to have a metadata set for each parttion/info_date separately.

my_table / 2023-09-20 / record_count=100
my_table / 2023-09-21 / record_count=121
...

Btw, on a different topic - Pramen already supports 'comment' column metadata field, that translates into column description of parquet files. It is used in Hive DDL to add column descriptions, but I haven't tested it much.