Atum 3+ does not write pending checkpoints
Closed this issue · 0 comments
It has come to attention that Atum 3 does not flush pending checkpoint correctly as 0.2.6 did.
When checkpoints are created using df.setCheckpoint(name)
and then written explicitly using df.writeInfoFile(path)
(Enceladus usage), it works ok.
However, if you only create the checkpoint, write data using spark.write
and rely on the _INFO
file being created/amended automatically, the pending checkpoint data will not be written to the _INFO
file. This problem is only apparent in cases when there is no metadata storer set, i.e. the initialization was done using spark.enableControlMeasuresTracking(sourceInfoFile = "data/input/_INFO")
(effectively with the second parameter destinationInfoFile
equal to "")
Affected versions: 3.0.0, 3.1.0
How to reproduce
- run
za.co.absa.atum.examples.SampleMeasurements1
on its own or via the runnerza.co.absa.atum.examples.SampleMeasurementsHdfsRunnerSpec
(the _INFO file write here relies on pending checkpoint being written to an inferred path) - observe the
examples/data/output/stage1_job_results/
where the data has been written, but the _INFO file is missing