0.8.7
This release adds awkward.toarrow
and awkward.toparquet
, renaming old functions to awkward.fromarrow
and awkward.fromparquet
for symmetry. They can only be used if you have pyarrow
installed, which is not a strict dependency (must be explicitly installed). String columns can be converted from Arrow to Awkward, but not from Awkward to Arrow because of an open question (see comments).
The implemented conversion is really just between Awkward and Arrow, letting pyarrow
convert to and from Parquet.
Top-level Awkward Tables
(possibly under ChunkedArray
or any MaskedArray
) are converted into Arrow Tables
, but deeper Awkward Tables
are converted into Arrow StructArrays
.
Arrow arrays with an associated mask adds a BitMaskedArray
to the Awkward structure. All Awkward MaskedArrays
are pushed down to the deepest Arrow level that can accept them. This might not be necessary—a better understanding of how to generate Arrow buffers might make this unnecessary.
Python types in Awkward ObjectArrays
can't be saved to Arrow, as it's a multilingual serialization system.
Awkward VirtualArrays
are evaluated before converting to Arrow. When reading from Parquet, all columns of all chunks are presented as Awkward VirtualArrays
so that they may be lazily read. By default, Awkward VirtualArrays
are read-once: the VirtualArray
object maintains a reference to the materialized array. That's good for multiple reading performance, but bad for memory use. The cache
parameter of fromparquet
lets you pass a dict-like cache, such as from the cachetools
library.
Awkward ChunkedArrays
become RecordBatches
in a Table
in toarrow
but separate Tables
in toparquet
. When reading fromparquet
, the separate Tables
define the level of granularity for incremental reading.
If toparquet
is given an iterable of Awkward data, it will incrementally write the Parquet file. The same can be achieved by an Awkward ChunkedArray
of Tables
of VirtualArray
, which is what fromparquet
returns, so the output of fromparquet
can be used as input to toparquet
.