Skip to content

Commit

Permalink
Merge branch 'main' into zarr_reader
Browse files Browse the repository at this point in the history
  • Loading branch information
norlandrhagen committed Dec 9, 2024
2 parents ac105ea + af9c374 commit 7b57bd0
Show file tree
Hide file tree
Showing 39 changed files with 3,534 additions and 678 deletions.
35 changes: 31 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,27 @@
[![Conda - Downloads](https://img.shields.io/conda/d/conda-forge/virtualizarr
)](https://anaconda.org/conda-forge/virtualizarr)

**VirtualiZarr creates virtual Zarr stores for cloud-friendly access to archival data, using familiar xarray syntax.**
## Cloud-Optimize your Scientific Data as Virtual Zarr stores, using xarray syntax.

VirtualiZarr (pronounced like "virtualizer" but more piratey) grew out of [discussions](https://github.com/fsspec/kerchunk/issues/377) on the [kerchunk repository](https://github.com/fsspec/kerchunk), and is an attempt to provide the game-changing power of kerchunk in a zarr-native way, and with a familiar array-like API.
The best way to distribute large scientific datasets is via the Cloud, in [Cloud-Optimized formats](https://guide.cloudnativegeo.org/) [^1]. But often this data is stuck in legacy pre-Cloud file formats such as netCDF.

You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk.
**VirtualiZarr[^2] makes it easy to create "Virtual" Zarr stores, allowing performant access to legacy data as if it were in the Cloud-Optimized [Zarr format](https://zarr.dev/), _without duplicating any data_.**

Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html).

### Features

* Create virtual references pointing to bytes inside a legacy file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets),
* Supports a [range of legacy file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5,
* [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger store using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html),
* Commit the virtual references to storage either using the [Kerchunk references](https://fsspec.github.io/kerchunk/spec.html) specification or the [Icechunk](https://icechunk.io/) transactional storage engine.
* Users access the virtual dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset).

### Inspired by Kerchunk

VirtualiZarr grew out of [discussions](https://github.com/fsspec/kerchunk/issues/377) on the [Kerchunk repository](https://github.com/fsspec/kerchunk), and is an attempt to provide the game-changing power of kerchunk but in a zarr-native way, and with a familiar array-like API.

_Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html)_
You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk.

### Development Status and Roadmap

Expand All @@ -38,10 +52,23 @@ We have a lot of ideas, including:

If you see other opportunities then we would love to hear your ideas!

### Talks and Presentations

- 2024/11/21 - MET Office Architecture Guild - Tom Nicholas - [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-talk-at-met-office)
- 2024/11/13 - Cloud-Native Geospatial conference - Raphael Hagen - [Slides](https://decks.carbonplan.org/cloud-native-geo/11-13-24)
- 2024/07/24 - ESIP Meeting - Sean Harkins - [Event](https://2024julyesipmeeting.sched.com/event/1eVP6) / [Recording](https://youtu.be/T6QAwJIwI3Q?t=3689)
- 2024/05/15 - Pangeo showcase - Tom Nicholas - [Event](https://discourse.pangeo.io/t/pangeo-showcase-virtualizarr-create-virtual-zarr-stores-using-xarray-syntax/4127/2) / [Recording](https://youtu.be/ioxgzhDaYiE) / [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-create-virtual-zarr-stores-using-xarray-syntax)

### Credits

This package was originally developed by [Tom Nicholas](https://github.com/TomNicholas) whilst working at [[C]Worthy](cworthy.org), who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project.

### Licence

Apache 2.0

### References

[^1]: [_Cloud-Native Repositories for Big Scientific Data_, Abernathey et. al., _Computing in Science & Engineering_.](https://ieeexplore.ieee.org/abstract/document/9354557)

[^2]: (Pronounced like "virtualizer" but more piratey 🦜)
6 changes: 3 additions & 3 deletions ci/upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,6 @@ dependencies:
- fsspec
- pip
- pip:
- icechunk # Installs zarr v3 as dependency
# - git+https://github.com/fsspec/kerchunk@main # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
- imagecodecs-numcodecs==2024.6.1
- icechunk>=0.1.0a7 # Installs zarr v3 as dependency
# - git+https://github.com/fsspec/kerchunk@main # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
- imagecodecs-numcodecs==2024.6.1
48 changes: 28 additions & 20 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from typing import Any, Dict, Optional

import h5py
import numpy as np
import pytest
Expand Down Expand Up @@ -35,6 +37,32 @@ def netcdf4_file(tmpdir):
return filepath


@pytest.fixture
def netcdf4_files_factory(tmpdir) -> callable:
def create_netcdf4_files(
encoding: Optional[Dict[str, Dict[str, Any]]] = None,
) -> tuple[str, str]:
ds = xr.tutorial.open_dataset("air_temperature")

# Split dataset into two parts
ds1 = ds.isel(time=slice(None, 1460))
ds2 = ds.isel(time=slice(1460, None))

# Save datasets to disk as NetCDF in the temporary directory with the provided encoding
filepath1 = f"{tmpdir}/air1.nc"
filepath2 = f"{tmpdir}/air2.nc"
ds1.to_netcdf(filepath1, encoding=encoding)
ds2.to_netcdf(filepath2, encoding=encoding)

# Close datasets
ds1.close()
ds2.close()

return filepath1, filepath2

return create_netcdf4_files


@pytest.fixture
def netcdf4_file_with_2d_coords(tmpdir):
ds = xr.tutorial.open_dataset("ROMS_example")
Expand Down Expand Up @@ -71,26 +99,6 @@ def hdf5_groups_file(tmpdir):
return filepath


@pytest.fixture
def netcdf4_files(tmpdir):
# Set up example xarray dataset
ds = xr.tutorial.open_dataset("air_temperature")

# split inrto equal chunks so we can concatenate them back together later
ds1 = ds.isel(time=slice(None, 1460))
ds2 = ds.isel(time=slice(1460, None))

# Save it to disk as netCDF (in temporary directory)
filepath1 = f"{tmpdir}/air1.nc"
filepath2 = f"{tmpdir}/air2.nc"
ds1.to_netcdf(filepath1)
ds2.to_netcdf(filepath2)
ds1.close()
ds2.close()

return filepath1, filepath2


@pytest.fixture
def hdf5_empty(tmpdir):
filepath = f"{tmpdir}/empty.nc"
Expand Down
24 changes: 14 additions & 10 deletions docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,17 @@ Open `docs/_build/html/index.html` in a web browser (on MacOS you can do this fr

## Making a release

Anyone with commit privileges to the repository can issue a release.

1. Navigate to the [https://github.com/zarr-developers/virtualizarr/releases](https://github.com/zarr-developers/virtualizarr/releases) releases page.
2. Select draft a new release.
3. Select 'Choose a tag', then 'create a new tag'
4. Enter the name for the new tag following the [EffVer](https://jacobtomlinson.dev/effver/) versioning scheme (e.g., releasing v0.2.0 as the next release after v0.1.0 denotes that “some small effort may be required to make sure this version works for you”).
4. Click 'Generate Release Notes' to draft notes based on merged pull requests.
5. Edit the draft release notes for consistency.
6. Select 'Publish' to publish the release. This should automatically upload the new release to PyPI and Conda-Forge.
7. Create and merge a PR to add a new empty section to the `docs/releases.rst` for the next release in the future.
Anyone with commit privileges to the repository can issue a release, and you should feel free to issue a release at any point in time when all the CI tests on `main` are passing.

1. Decide on the release version number for the new release, following the [EffVer](https://jacobtomlinson.dev/effver/) versioning scheme (e.g., releasing v0.2.0 as the next release after v0.1.0 denotes that “some small effort may be required to make sure this version works for you”).
2. Write a high-level summary of the changes in this release, and write it into the release notes in `docs/releases.rst`. Create and merge a PR which adds the summary and also changes the release notes to say today's date and the version number of the new release. Don't add the blank template for future releases yet.
3. Navigate to the [https://github.com/zarr-developers/virtualizarr/releases](https://github.com/zarr-developers/virtualizarr/releases) releases page.
4. Select 'Draft a new release'.
5. Select 'Choose a tag', then 'Create a new tag'
6. Enter the name for the new tag (i.e. the release version number).
7. Click 'Generate Release Notes' to draft notes based on merged pull requests, and paste the same release summary you wrote earlier at the top.
8. Edit the draft release notes for consistency.
9. Select 'Publish' to publish the release. This should automatically upload the new release to [PyPI](https://pypi.org/project/virtualizarr/) and [conda-forge](https://anaconda.org/conda-forge/virtualizarr).
10. Check that this has run successfully (PyPI should show the new version number very quickly, but conda-forge might take several hours).
11. Create and merge a PR to add a new empty section to the `docs/releases.rst` for the next release in the future.
12. (Optional) Advertise the release on social media 📣
7 changes: 7 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Examples

The following examples demonstrate the use of VirtualiZarr to create virtual datasets of various kinds:

1. [Appending new daily NOAA SST data to Icechunk](https://github.com/zarr-developers/VirtualiZarr/blob/main/examples/append/noaa-cdr-sst.ipynb)
2. [Parallel reference generation using Coiled Functions](https://github.com/zarr-developers/VirtualiZarr/blob/main/examples/coiled/terraclimate.ipynb)
3. [Serverless parallel reference generation using Lithops](https://github.com/zarr-developers/VirtualiZarr/tree/main/examples/virtualizarr-with-lithops)
Loading

0 comments on commit 7b57bd0

Please sign in to comment.