Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

0.8.7

Compare
Choose a tag to compare
@jpivarski jpivarski released this 08 Mar 15:59
· 733 commits to master since this release
1681251

This release adds awkward.toarrow and awkward.toparquet, renaming old functions to awkward.fromarrow and awkward.fromparquet for symmetry. They can only be used if you have pyarrow installed, which is not a strict dependency (must be explicitly installed). String columns can be converted from Arrow to Awkward, but not from Awkward to Arrow because of an open question (see comments).

The implemented conversion is really just between Awkward and Arrow, letting pyarrow convert to and from Parquet.

Top-level Awkward Tables (possibly under ChunkedArray or any MaskedArray) are converted into Arrow Tables, but deeper Awkward Tables are converted into Arrow StructArrays.

Arrow arrays with an associated mask adds a BitMaskedArray to the Awkward structure. All Awkward MaskedArrays are pushed down to the deepest Arrow level that can accept them. This might not be necessary—a better understanding of how to generate Arrow buffers might make this unnecessary.

Python types in Awkward ObjectArrays can't be saved to Arrow, as it's a multilingual serialization system.

Awkward VirtualArrays are evaluated before converting to Arrow. When reading from Parquet, all columns of all chunks are presented as Awkward VirtualArrays so that they may be lazily read. By default, Awkward VirtualArrays are read-once: the VirtualArray object maintains a reference to the materialized array. That's good for multiple reading performance, but bad for memory use. The cache parameter of fromparquet lets you pass a dict-like cache, such as from the cachetools library.

Awkward ChunkedArrays become RecordBatches in a Table in toarrow but separate Tables in toparquet. When reading fromparquet, the separate Tables define the level of granularity for incremental reading.

If toparquet is given an iterable of Awkward data, it will incrementally write the Parquet file. The same can be achieved by an Awkward ChunkedArray of Tables of VirtualArray, which is what fromparquet returns, so the output of fromparquet can be used as input to toparquet.