delta-io/connectors

Delta standalone problem - snappy compression hardcoded for delta log parquet files

krzysztoflupa opened this issue · 2 comments

Runtime

macOS intel procesor.

Docker base image - adoptopenjdk/openjdk16:alpine-jre.
snappy-java 1.1.8.4 - released one year ago - full of bugs - they have no intention to release another one with fixes
hadoop-client, hadoop-aws 3.3.4
delta-standalone_2.12 0.6.0
JDK openjdk16 -checked with variety of versions - problem is the same

Problem description

Considering the fact that setting up - snappy (snappy-java 1.1.8.4) - in isolated environment like k8s is not trivial (but possible), i needed to use different compression codec (gzip) for parquet files (data parquet files). and thats perfectly fine and configurable.

But what delta standalone does - for one in 10 commits it creates parquet file for delta log - still - this is fine.

The problem is - delta standalone uses snappy for compression and - this codec is hardcoded.
https://github.com/delta-io/connectors/blob/master/standalone/src/main/scala/io/delta/standalone/internal/Checkpoints.scala

this is far from perfect - compression method for delta log parquet should be fully configurable - and - it should be consistent with hadoop/spark properties for compression.

Consequences

1 - inconsistent compression type for data and delta log parquet files

2 - plenty of hard to investigate problems (snappy-java 1.1.8.4) with snappy lib.

java.lang.IllegalArgumentException: newLimit > capacity: (12 > 6)
	at java.base/java.nio.Buffer.createLimitException(Unknown Source) ~[na:na]
	at java.base/java.nio.Buffer.limit(Unknown Source) ~[na:na]
	at java.base/java.nio.ByteBuffer.limit(Unknown Source) ~[na:na]
	at java.base/java.nio.MappedByteBuffer.limit(Unknown Source) ~[na:na]
	at java.base/java.nio.MappedByteBuffer.limit(Unknown Source) ~[na:na]

Ideas

It may be snappy by default but it cannot be hardcoded.

Hi @krzysztoflupa thanks for making this issue! Are you interested in contributing the fix?

This repo has been deprecated and the code is moved under connectors module in https://github.com/delta-io/delta repository. Please create the issue in repository https://github.com/delta-io/delta. See #556 for details.