delta-io/connectors

[Feature Request] Ensure that path is available as part of PendingFileRecoverable interface

gopik opened this issue · 5 comments

gopik commented

New sink API supports getPath() and getSize(). But currently the API is returning null for getPath since we are not passing it down.

getPath() is needed for stats computation when a file is committed.

Hi @gopik - thanks for making this issue. You seem to have some context of the code itself, would you be willing to contribute this fix?

What exactly is meant by:

since we are not passing it down.

Just want to clarify.

gopik commented

https://nightlies.apache.org/flink/flink-docs-master/api/java/index.html?org/apache/flink/core/classloading/package-summary.html

At this line we create a PendingRecoverable using a deprecated overload (see the javadoc). If we used the other overload, we could capture the file path and size as part of pendingrecoverable.

Why is this important

In DeltaPendingFile, we can compute file stats if we have the actual file path (that would be published on commit). Right now we wouldn't know the file path until after commit. This implementation would enable us to have access to the path before commit. Otherwise a lot more plumbing is needed in code, we need to either have the file path in DeltaPendingFile as state, which would need serializer/deseriazer changes or we need to pass it while committing, which needs code changes in unrelated parts (like committer).

This repo has been deprecated and the code is moved under connectors module in https://github.com/delta-io/delta repository. Please create the issue in repository https://github.com/delta-io/delta. See #556 for details.