YotpoLtd/metorikku

Significance of tableName for output in metric.yaml file

kiranbobba opened this issue · 2 comments

I am trying to understand the code we currently have. It is similar to the below one.

job.yml
metrics:
  - metric.yaml
inputs:
  df_input:
    file: 
      path: s3a://bucket1/database1/table1/*.csv
      format: csv
      options:
        header: true
        delimiter: ","
output:
  file:
    dir: s3a://bucket1/

metric.yaml
steps:
  - dataFrameName: df1
    sql:
      SELECT * FROM df_input

output:
  - dataFrameName: df1
    outputType: File
    format: parquet
    outputOptions:
      saveMode: Overwrite
      path: final/hive/database1/table1
      protectFromEmptyOutput: false
      tableName: database1.table1
      partitionBy:
        - as_of_date

What is the significance of tableName under output in metric.yaml file? I saw the comment for this property as "# save output to hive metastore (or any other catalog provider)" from https://github.com/YotpoLtd/metorikku/blob/master/config/metric_config_sample.yaml. What does that mean? Does it mean that it will issue "MSCK REPAIR" or "ALTER TABLE ADD PARTITION" or something similar to update Hive metastore? What are prerequisites for this property to work. It worked for us in our old cluster but not on the new one.

Another question indirectly linked to the above one. If I have 2 metric files in my job.yaml file. If I want to access the data written to a file (on which Hive external table is defined) from first metric file in the second one is it possible with the assumption that tableName property of the output is not working in the first metric file? Is there any example that does this?

For the above one, it is throwing the following error. It works fine when I remove the tableName property. What could be the reason?

Caused by: com.yotpo.metorikku.exceptions.MetorikkuWriteFailedException: Failed to write dataFrame: df1 to output: File on metric: metric
.......
.......
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'database1' not found;

It means that you are creating a Hive external table. The error tells you that the database doesn't exists in the metastore, so you should create it beforehand.