delta-io/connectors

[BUG] Data Loss using Flink-Delta Sink on Apache Flink 1.15

scottsand-db opened this issue · 3 comments

The goal of this issue is to inform Flink-Delta connector users of a known data loss problem with the Flink-Delta Sink, track the status of the ongoing investigation and solution, and answer any questions that users may have.

Overview

A user of the flink-delta connector reported various problems (including data loss, but also recovery crashes) while using the flink-delta sink on Flink 1.14 and 1.15. After diving into the logs, several Apache Flink bugs were found. Flink issues FLINK-29509, FLINK-29512, and FLINK-29627 were created.

After these issues were resolved and their corresponding PRs merged, the data loss issues persisted. We focused our attention on the flink-delta connector and discovered the following issue.

As a reminder, the GlobalCommitter is responsible for combining DeltaCommittables into a DeltaGlobalCommittable and then committing that file metadata into a Delta table. An assumption made during the design of the flink-delta sink was that all DeltaGlobalCommmitables for a given checkpoint X would be passed to the GlobalCommitter at the same time. Because of this assumption, the flink-delta sink had this line of code (comparing the incoming DeltaGlobalCommittable’s checkpointId to the most recent checkpointId for this Flink application committed to the Delta table) which was used to prevent a DeltaGlobalCommittable from being committed to a Delta table twice.

However, it appears that this assumption is incorrect. Different writers produce different committables that all belong to a given checkpoint (say, committables 5.1 and 5.2 both belong to checkpoint data 5). There is no guarantee that both of these committables will be part of the DeltaGlobalCommittable that we will try to commit to the Delta table. This can be caused by dropped RPC calls or just arbitrary delay in sending data.

Example Scenario

Flink-Delta Sink - Data Loss Example

Investigation

This design doc (note: still WIP) is being used to investigate the data loss issue, state our assumptions, and propose different solutions.

Please give it a read and give us your feedback!

Thanks @kristoffSC for working on this!

PR ready:
#473

Closing since #473 was merged.