delta-io/connectors

External parquet files - trying to relativize path

aamend opened this issue · 1 comments

aamend commented

Context

It may be useful to point delta log file to external parquet files (parquet living on a different filesystem).
In my example, a _delta_log needs to be created as shallow clone to external files on s3://, failing with below exception.

Note that suggested spark configuration does not work / is irrelevant given this is delta standalone. The error is wrong, and the "ignoreError" flag is not implemented. This definitely is a bug

See: https://github.com/delta-io/connectors/blob/master/standalone/src/main/scala/io/delta/standalone/internal/OptimisticTransactionImpl.scala#L250
And: https://github.com/delta-io/connectors/blob/master/standalone/src/main/scala/io/delta/standalone/internal/util/DeltaFileOperations.scala#L57

Error

Job aborted due to stage failure: Failed to relativize the path (s3://snip/snip/snip.parquet). 
This can happen when absolute paths make it into the transaction log, which start with the scheme s3://, wasbs:// or adls://. 
This is a bug that has existed before DBR 5.0.
To fix this issue, please upgrade your writer jobs to DBR 5.0 and please run:
com.databricks.delta.Delta.fixAbsolutePathsInLog("s3://snip/snip/snip.parquet").

If this table was created with a shallow clone across file systems
(different buckets/containers) and this table is NOT USED IN PRODUCTION, you can
set the SQL configuration spark.databricks.delta.vacuum.relativize.ignoreError
to true. Using this SQL configuration could lead to accidental data loss,
therefore we do not recommend the use of this flag unless
this is a shallow clone for testing purposes.

Ask

Please allow for parameter ignoreError to be set (set to false by default and cannot be overridden). https://github.com/delta-io/connectors/blob/master/standalone/src/main/scala/io/delta/standalone/internal/util/DeltaFileOperations.scala#L49

Steps to reproduce

val files = List(
  new AddFile("s3://snip/snip/snip.parquet", jmap, path.length(), path.lastModified(), true, null, jmap),
).asJava

val log = DeltaLog.forTable(conf, "/path/delta/standalone")
val txn = log.startTransaction()
txn.commit(files, new Operation(Operation.Name.WRITE), "local")

Environment

  • io.delta:delta-standalone_2.12:0.6.0
  • databricks DBR 12.2LTS

This repo has been deprecated and the code is moved under connectors module in https://github.com/delta-io/delta repository. Please create the issue in repository https://github.com/delta-io/delta. See #556 for details.