Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow-rs/parquet can't read files produced by parquet2 #227

Open
mattbonnell opened this issue Apr 26, 2023 · 1 comment
Open

arrow-rs/parquet can't read files produced by parquet2 #227

mattbonnell opened this issue Apr 26, 2023 · 1 comment

Comments

@mattbonnell
Copy link

  • Wrote a file with parquet2 v0.17.2
  • Tried to read it using https://github.com/timvw/qv, which uses arrow-rs/parquet
  • Get the following error
thread 'main' panicked at 'index out of bounds: the len is 8192 but the index is 8192', /Users/runner/Library/Caches/Homebrew/cargo_cache/registry/src/github.com-1ecc6299db9ec823/parquet-26.0.0/src/encodings/rle.rs:490:25
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic_bounds_check
   3: parquet::encodings::rle::RleDecoder::get_batch_with_dict
   4: <parquet::encodings::decoding::DictDecoder<T> as parquet::encodings::decoding::Decoder<T>>::get
   5: <parquet::column::reader::decoder::ColumnValueDecoderImpl<T> as parquet::column::reader::decoder::ColumnValueDecoder>::read
   6: parquet::column::reader::GenericColumnReader<R,D,V>::read_batch
   7: parquet::arrow::record_reader::GenericRecordReader<V,CV>::read_records
   8: parquet::arrow::array_reader::read_records
   9: <parquet::arrow::array_reader::primitive_array::PrimitiveArrayReader<T> as parquet::arrow::array_reader::ArrayReader>::read_records
  10: <parquet::arrow::array_reader::struct_array::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::read_records
  11: <parquet::arrow::arrow_reader::ParquetRecordBatchReader as core::iter::traits::iterator::Iterator>::next
  12: <S as futures_core::stream::TryStream>::try_poll_next
  13: <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next
  14: <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next
  15: <datafusion::physical_plan::file_format::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next
  16: <datafusion::physical_plan::projection::ProjectionStream as futures_core::stream::Stream>::poll_next
  17: <datafusion::physical_plan::limit::LimitStream as futures_core::stream::Stream>::poll_next
  18: <futures_util::stream::try_stream::try_collect::TryCollect<St,C> as core::future::future::Future>::poll
  19: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  20: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  21: tokio::park::thread::CachedParkThread::block_on
  22: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
  23: tokio::runtime::Runtime::block_on
  24: qv::main

The panic happens here https://github.com/apache/arrow-rs/blob/b1642ab150ee61f730b2cda51bb917d42d9aeeb1/parquet/src/encodings/rle.rs#L490.

I've noticed similar errors when trying to read using Trino https://github.com/trinodb/trino

@the80srobot
Copy link

the80srobot commented Dec 14, 2023

I had the same problem, but so far all issues have been caused by doing something wrong with respect to the Parquet format.

The parquet2 crate doesn't do enough validation to guarantee its output is correct. For example, it'll happily let you skip definition levels even when they're required.

What I think the crate needs is a few solid examples of how to write a parquet file. Currently, it's up to you to figure it out and it's easy to get it wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants