Support disabling automatic decompression of gzip files in GCS connector
blackvvine opened this issue · 0 comments
Summary
Hadoop's default behaviour is to automatically decompress files with the .gz
extension (see here
).
When gzip encoding is enabled (fs.gs.inputstream.support.gzip.encoding.enable=true
), upon reading gzip-encoded files from GCS, both the GCS connector and Hadoop FS will attempt to decompress the file, leading to errors like:
Caused by: java.io.IOException: incorrect header check
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
[...]
Expected Behaviour
Since disabling the gzip decompression behaviour in Hadoop is not possible without changing the hadoop-core
library, it's helpful if the GCS connector can automatically skip the decompression when the file extension is .gz
or at least provide a configuration property for disabling the automatic decompression.
Current Workarounds
Either unset the Content-Encoding: gzip
metadata field on the GCS object (so the connector would not decompress it) or remove the .gz
extension from the object name