Skip to content

Commit

Permalink
Fix bug in RT of parquet detection (#278)
Browse files Browse the repository at this point in the history
* fix bug in RT of parquet

* Update virtualizarr/readers/kerchunk.py

Co-authored-by: Justus Magin <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adds .parquet info to ValueError

* Update kerchunk.py

Co-authored-by: Tom Nicholas <[email protected]>

---------

Co-authored-by: Justus Magin <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tom Nicholas <[email protected]>
  • Loading branch information
4 people authored Nov 4, 2024
1 parent ba46a77 commit ab23caa
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 6 deletions.
8 changes: 4 additions & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -385,13 +385,13 @@ Currently you can only serialize in-memory variables to kerchunk references if t
When you have many chunks, the reference file can get large enough to be unwieldy as json. In that case the references can be instead stored as parquet. Again this uses kerchunk internally.

```python
combined_vds.virtualize.to_kerchunk('combined.parq', format='parquet')
combined_vds.virtualize.to_kerchunk('combined.parquet', format='parquet')
```

And again we can read these references using the "kerchunk" backend as if it were a regular Zarr store

```python
combined_ds = xr.open_dataset('combined.parq', engine="kerchunk")
combined_ds = xr.open_dataset('combined.parquet', engine="kerchunk")
```

By default references are placed in separate parquet file when the total number of references exceeds `record_size`. If there are fewer than `categorical_threshold` unique urls referenced by a particular variable, url will be stored as a categorical variable.
Expand Down Expand Up @@ -444,9 +444,9 @@ You can open existing Kerchunk `json` or `parquet` references as Virtualizarr vi

```python

vds = open_virtual_dataset('combined.json', format='kerchunk')
vds = open_virtual_dataset('combined.json', filetype='kerchunk', indexes={})
# or
vds = open_virtual_dataset('combined.parquet', format='kerchunk')
vds = open_virtual_dataset('combined.parquet', filetype='kerchunk', indexes={})

```

Expand Down
6 changes: 4 additions & 2 deletions virtualizarr/readers/kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ def open_virtual_dataset(
fs = _FsspecFSFromFilepath(filepath=filepath, reader_options=reader_options)

# The kerchunk .parquet storage format isn't actually a parquet, but a directory that contains named parquets for each group/variable.
if fs.filepath.endswith("ref.parquet"):
if fs.filepath.endswith(".parquet") and fs.fs.isfile(
f"{fs.filepath}/.zmetadata"
):
from fsspec.implementations.reference import LazyReferenceMapper

lrm = LazyReferenceMapper(filepath, fs.fs)
Expand All @@ -61,7 +63,7 @@ def open_virtual_dataset(

else:
raise ValueError(
"The input Kerchunk reference did not seem to be in Kerchunk's JSON or Parquet spec: https://fsspec.github.io/kerchunk/spec.html. The Kerchunk format autodetection is quite flaky, so if your reference matches the Kerchunk spec feel free to open an issue: https://github.com/zarr-developers/VirtualiZarr/issues"
"The input Kerchunk reference did not seem to be in Kerchunk's JSON or Parquet spec: https://fsspec.github.io/kerchunk/spec.html. If your Kerchunk generated references are saved in parquet format, make sure the file extension is `.parquet`. The Kerchunk format autodetection is quite flaky, so if your reference matches the Kerchunk spec feel free to open an issue: https://github.com/zarr-developers/VirtualiZarr/issues"
)

# TODO would be more efficient to drop these before converting them into ManifestArrays, i.e. drop them from the kerchunk refs dict
Expand Down

0 comments on commit ab23caa

Please sign in to comment.