AbsaOSS/atum

Atum 3+ does not write pending checkpoints

Closed this issue · 0 comments

It has come to attention that Atum 3 does not flush pending checkpoint correctly as 0.2.6 did.
When checkpoints are created using df.setCheckpoint(name) and then written explicitly using df.writeInfoFile(path) (Enceladus usage), it works ok.

However, if you only create the checkpoint, write data using spark.write and rely on the _INFO file being created/amended automatically, the pending checkpoint data will not be written to the _INFO file. This problem is only apparent in cases when there is no metadata storer set, i.e. the initialization was done using spark.enableControlMeasuresTracking(sourceInfoFile = "data/input/_INFO") (effectively with the second parameter destinationInfoFile equal to "")

Affected versions: 3.0.0, 3.1.0

How to reproduce

  1. run za.co.absa.atum.examples.SampleMeasurements1 on its own or via the runner za.co.absa.atum.examples.SampleMeasurementsHdfsRunnerSpec (the _INFO file write here relies on pending checkpoint being written to an inferred path)
  2. observe the examples/data/output/stage1_job_results/ where the data has been written, but the _INFO file is missing