GoogleCloudDataproc/hadoop-connectors

Support disabling automatic decompression of gzip files in GCS connector

blackvvine opened this issue · 0 comments

Summary

Hadoop's default behaviour is to automatically decompress files with the .gz extension (see here).

When gzip encoding is enabled (fs.gs.inputstream.support.gzip.encoding.enable=true), upon reading gzip-encoded files from GCS, both the GCS connector and Hadoop FS will attempt to decompress the file, leading to errors like:

Caused by: java.io.IOException: incorrect header check
	at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
	at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
[...]

Expected Behaviour

Since disabling the gzip decompression behaviour in Hadoop is not possible without changing the hadoop-core library, it's helpful if the GCS connector can automatically skip the decompression when the file extension is .gz or at least provide a configuration property for disabling the automatic decompression.

Current Workarounds

Either unset the Content-Encoding: gzip metadata field on the GCS object (so the connector would not decompress it) or remove the .gz extension from the object name