You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hadoop's default behaviour is to automatically decompress files with the .gz extension (see here).
When gzip encoding is enabled (fs.gs.inputstream.support.gzip.encoding.enable=true), upon reading gzip-encoded files from GCS, both the GCS connector and Hadoop FS will attempt to decompress the file, leading to errors like:
Caused by: java.io.IOException: incorrect header check
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
[...]
Expected Behaviour
Since disabling the gzip decompression behaviour in Hadoop is not possible without changing the hadoop-core library, it's helpful if the GCS connector can automatically skip the decompression when the file extension is .gz or at least provide a configuration property for disabling the automatic decompression.
Current Workarounds
Either unset the Content-Encoding: gzip metadata field on the GCS object (so the connector would not decompress it) or remove the .gz extension from the object name
The text was updated successfully, but these errors were encountered:
Summary
Hadoop's default behaviour is to automatically decompress files with the
.gz
extension (seehere
).When gzip encoding is enabled (
fs.gs.inputstream.support.gzip.encoding.enable=true
), upon reading gzip-encoded files from GCS, both the GCS connector and Hadoop FS will attempt to decompress the file, leading to errors like:Expected Behaviour
Since disabling the gzip decompression behaviour in Hadoop is not possible without changing the
hadoop-core
library, it's helpful if the GCS connector can automatically skip the decompression when the file extension is.gz
or at least provide a configuration property for disabling the automatic decompression.Current Workarounds
Either unset the
Content-Encoding: gzip
metadata field on the GCS object (so the connector would not decompress it) or remove the.gz
extension from the object nameThe text was updated successfully, but these errors were encountered: