Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialisation Error for Nested Types #134

Open
sundeepks opened this issue May 1, 2022 · 2 comments
Open

Deserialisation Error for Nested Types #134

sundeepks opened this issue May 1, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@sundeepks
Copy link

sundeepks commented May 1, 2022

Hi while deserialising the parquet with nested types facing error, do we have the implementation for the following code snippet (got from the examples section)

Below code executes when page.descriptor.max_rep_level > 0, do we have the primitive_nested implementation for byte array ?



_ => match page.dictionary_page() {
            None => match physical_type {
                PhysicalType::Int64 => Ok(primitive_nested::page_to_array::<i64>(page)?),
                _ => {
                   todo!()
                }
            },
            Some(_) => match physical_type {
                PhysicalType::Int64 => Ok(primitive_nested::page_dict_to_array::<i64>(page)?),
                _ => {
                   todo!()
                }
            },
        },
 
@jorgecarleitao
Copy link
Owner

Hey!

I know of 2: one in arrow2 and one under tests/.

The general idea is:

  1. split the page buffer in rep,def,values

  2. attach 3 decoders, one for rep, one for def, one for values - the rep and def should be HybridRleDecoder; the values should be whatever encoding is being used for that (the nested logic is independent of the primitive type). Something like:

    let (rep_levels, def_levels, _) = split_buffer(page);
    
    let max_rep_level = page.descriptor.max_rep_level;
    let max_def_level = page.descriptor.max_def_level;
    
    let reps =
        HybridRleDecoder::new(rep_levels, get_bit_width(max_rep_level), page.num_values());
    let defs =
        HybridRleDecoder::new(def_levels, get_bit_width(max_def_level), page.num_values());
    
    let iter = reps.zip(defs);

    (see https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L271)

  3. advance the iterators and reconstruct the nested type according to the dremel logic. This depends on how the specific format stores nested types (e.g. Vec<Vec<i32>> vs Vec<i32> + offsets). See e.g. https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L391 for how arrow2 does it.

One important thing to remember is that the length of the rep and def iterators (page.num_values) is not the number of values in the values iterator. For example:

# [[0, None], [], [10]]
reps, defs = list(
    zip(
        *[
            (0, 2),  # 0
            (1, 1),  # 1
            (0, 0),  #
            (0, 2),  # 10
        ]
    )
)

the values in this case contain 2 entries (0 and 10), the rep and levels contain 4 each.

@sundeepks
Copy link
Author

Hey, Thanks for the response, I was referring to the one in the tests https://github.com/jorgecarleitao/parquet2/blob/fa6fa3ca3848c29d8efa80fbf42ee6a5a58cb077/tests/it/read/mod.rs..
Is it possible to complete the todo placeholder what you have in tests or any reference code so I can complete the todo part ?

@jorgecarleitao jorgecarleitao added the question Further information is requested label Jun 8, 2022
@jorgecarleitao jorgecarleitao added enhancement New feature or request and removed question Further information is requested labels Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants