Support disabling automatic decompression of gzip files in GCS connector #1060

blackvvine · 2023-10-06T18:33:48Z

Summary

Hadoop's default behaviour is to automatically decompress files with the .gz extension (see here).

When gzip encoding is enabled (fs.gs.inputstream.support.gzip.encoding.enable=true), upon reading gzip-encoded files from GCS, both the GCS connector and Hadoop FS will attempt to decompress the file, leading to errors like:

Caused by: java.io.IOException: incorrect header check
	at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
	at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
[...]

Expected Behaviour

Since disabling the gzip decompression behaviour in Hadoop is not possible without changing the hadoop-core library, it's helpful if the GCS connector can automatically skip the decompression when the file extension is .gz or at least provide a configuration property for disabling the automatic decompression.

Current Workarounds

Either unset the Content-Encoding: gzip metadata field on the GCS object (so the connector would not decompress it) or remove the .gz extension from the object name

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support disabling automatic decompression of gzip files in GCS connector #1060

Support disabling automatic decompression of gzip files in GCS connector #1060

blackvvine commented Oct 6, 2023

Support disabling automatic decompression of gzip files in GCS connector #1060

Support disabling automatic decompression of gzip files in GCS connector #1060

Comments

blackvvine commented Oct 6, 2023

Summary

Expected Behaviour

Current Workarounds