AbsaOSS/spark-metadata-tool

DRAFT ISSUE: Merge spark metadata

Closed this issue · 1 comments

Background

User wants to reset checkpoints, but keep the existing parquet files in the output. If the existing spark metadata folder is deleted, we lose track of the existing parquet files. If we don't delete the metadata folder, the microbatches will be skipped, because Spark thinks it has already processed the microbatch, as there is already a file in the metadata log

Feature

Function: merge
Input:

  • New metadata folder
  • Old metadata folder

Get the list of parquet files from the old metadata folder (i.e. everything from the latest compact file + all newer files)

Prepend that list to the latest compact file, in order to keep the order. If no compact file exists, prepend to the first file.

Assume that no process will write to either spark metadata folders

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas
1.
2.
3.

Duplicate #31