Merge branch 'main' into zarr_reader

zarr-developers · Dec 9, 2024 · 7b57bd0 · 7b57bd0
2 parents ac105ea + af9c374
commit 7b57bd0
Show file tree

Hide file tree

Showing 39 changed files with 3,534 additions and 678 deletions.
diff --git a/README.md b/README.md
@@ -14,13 +14,27 @@
 [![Conda - Downloads](https://img.shields.io/conda/d/conda-forge/virtualizarr
 )](https://anaconda.org/conda-forge/virtualizarr)
 
-**VirtualiZarr creates virtual Zarr stores for cloud-friendly access to archival data, using familiar xarray syntax.**
+## Cloud-Optimize your Scientific Data as Virtual Zarr stores, using xarray syntax.
 
-VirtualiZarr (pronounced like "virtualizer" but more piratey) grew out of [discussions](https://github.com/fsspec/kerchunk/issues/377) on the [kerchunk repository](https://github.com/fsspec/kerchunk), and is an attempt to provide the game-changing power of kerchunk in a zarr-native way, and with a familiar array-like API.
+The best way to distribute large scientific datasets is via the Cloud, in [Cloud-Optimized formats](https://guide.cloudnativegeo.org/) [^1]. But often this data is stuck in legacy pre-Cloud file formats such as netCDF.
 
-You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk.
+**VirtualiZarr[^2] makes it easy to create "Virtual" Zarr stores, allowing performant access to legacy data as if it were in the Cloud-Optimized [Zarr format](https://zarr.dev/), _without duplicating any data_.**
+
+Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html).
+
+### Features
+
+* Create virtual references pointing to bytes inside a legacy file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets),
+* Supports a [range of legacy file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5,
+* [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger store using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html),
+* Commit the virtual references to storage either using the [Kerchunk references](https://fsspec.github.io/kerchunk/spec.html) specification or the [Icechunk](https://icechunk.io/) transactional storage engine.
+* Users access the virtual dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset).
+
+### Inspired by Kerchunk
+
+VirtualiZarr grew out of [discussions](https://github.com/fsspec/kerchunk/issues/377) on the [Kerchunk repository](https://github.com/fsspec/kerchunk), and is an attempt to provide the game-changing power of kerchunk but in a zarr-native way, and with a familiar array-like API.
 
-_Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html)_
+You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk.
 
 ### Development Status and Roadmap
 
@@ -38,10 +52,23 @@ We have a lot of ideas, including:
 
 If you see other opportunities then we would love to hear your ideas!
 
+### Talks and Presentations
+
+- 2024/11/21 - MET Office Architecture Guild - Tom Nicholas - [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-talk-at-met-office)
+- 2024/11/13 - Cloud-Native Geospatial conference - Raphael Hagen - [Slides](https://decks.carbonplan.org/cloud-native-geo/11-13-24)
+- 2024/07/24 - ESIP Meeting - Sean Harkins - [Event](https://2024julyesipmeeting.sched.com/event/1eVP6) / [Recording](https://youtu.be/T6QAwJIwI3Q?t=3689)
+- 2024/05/15 - Pangeo showcase - Tom Nicholas - [Event](https://discourse.pangeo.io/t/pangeo-showcase-virtualizarr-create-virtual-zarr-stores-using-xarray-syntax/4127/2) / [Recording](https://youtu.be/ioxgzhDaYiE) / [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-create-virtual-zarr-stores-using-xarray-syntax)
+
 ### Credits
 
 This package was originally developed by [Tom Nicholas](https://github.com/TomNicholas) whilst working at [[C]Worthy](cworthy.org), who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project.
 
 ### Licence
 
 Apache 2.0
+
+### References
+
+[^1]: [_Cloud-Native Repositories for Big Scientific Data_, Abernathey et. al., _Computing in Science & Engineering_.](https://ieeexplore.ieee.org/abstract/document/9354557)
+
+[^2]: (Pronounced like "virtualizer" but more piratey 🦜)
diff --git a/ci/upstream.yml b/ci/upstream.yml
@@ -28,6 +28,6 @@ dependencies:
   - fsspec
   - pip
   - pip:
-    - icechunk # Installs zarr v3 as dependency
-    # - git+https://github.com/fsspec/kerchunk@main  # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
-    - imagecodecs-numcodecs==2024.6.1
+      - icechunk>=0.1.0a7 # Installs zarr v3 as dependency
+      # - git+https://github.com/fsspec/kerchunk@main  # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
+      - imagecodecs-numcodecs==2024.6.1
diff --git a/conftest.py b/conftest.py
@@ -1,3 +1,5 @@
+from typing import Any, Dict, Optional
+
 import h5py
 import numpy as np
 import pytest
@@ -35,6 +37,32 @@ def netcdf4_file(tmpdir):
     return filepath
 
 
+@pytest.fixture
+def netcdf4_files_factory(tmpdir) -> callable:
+    def create_netcdf4_files(
+        encoding: Optional[Dict[str, Dict[str, Any]]] = None,
+    ) -> tuple[str, str]:
+        ds = xr.tutorial.open_dataset("air_temperature")
+
+        # Split dataset into two parts
+        ds1 = ds.isel(time=slice(None, 1460))
+        ds2 = ds.isel(time=slice(1460, None))
+
+        # Save datasets to disk as NetCDF in the temporary directory with the provided encoding
+        filepath1 = f"{tmpdir}/air1.nc"
+        filepath2 = f"{tmpdir}/air2.nc"
+        ds1.to_netcdf(filepath1, encoding=encoding)
+        ds2.to_netcdf(filepath2, encoding=encoding)
+
+        # Close datasets
+        ds1.close()
+        ds2.close()
+
+        return filepath1, filepath2
+
+    return create_netcdf4_files
+
+
 @pytest.fixture
 def netcdf4_file_with_2d_coords(tmpdir):
     ds = xr.tutorial.open_dataset("ROMS_example")
@@ -71,26 +99,6 @@ def hdf5_groups_file(tmpdir):
     return filepath
 
 
-@pytest.fixture
-def netcdf4_files(tmpdir):
-    # Set up example xarray dataset
-    ds = xr.tutorial.open_dataset("air_temperature")
-
-    # split inrto equal chunks so we can concatenate them back together later
-    ds1 = ds.isel(time=slice(None, 1460))
-    ds2 = ds.isel(time=slice(1460, None))
-
-    # Save it to disk as netCDF (in temporary directory)
-    filepath1 = f"{tmpdir}/air1.nc"
-    filepath2 = f"{tmpdir}/air2.nc"
-    ds1.to_netcdf(filepath1)
-    ds2.to_netcdf(filepath2)
-    ds1.close()
-    ds2.close()
-
-    return filepath1, filepath2
-
-
 @pytest.fixture
 def hdf5_empty(tmpdir):
     filepath = f"{tmpdir}/empty.nc"

diff --git a/docs/contributing.md b/docs/contributing.md
@@ -39,13 +39,17 @@ Open `docs/_build/html/index.html` in a web browser (on MacOS you can do this fr
 
 ## Making a release
 
-Anyone with commit privileges to the repository can issue a release.
-
-1. Navigate to the [https://github.com/zarr-developers/virtualizarr/releases](https://github.com/zarr-developers/virtualizarr/releases) releases page.
-2. Select draft a new release.
-3. Select 'Choose a tag', then 'create a new tag'
-4. Enter the name for the new tag following the [EffVer](https://jacobtomlinson.dev/effver/) versioning scheme (e.g., releasing v0.2.0 as the next release after v0.1.0 denotes that “some small effort may be required to make sure this version works for you”).
-4. Click 'Generate Release Notes' to draft notes based on merged pull requests.
-5. Edit the draft release notes for consistency.
-6. Select 'Publish' to publish the release. This should automatically upload the new release to PyPI and Conda-Forge.
-7. Create and merge a PR to add a new empty section to the `docs/releases.rst` for the next release in the future.
+Anyone with commit privileges to the repository can issue a release, and you should feel free to issue a release at any point in time when all the CI tests on `main` are passing.
+
+1. Decide on the release version number for the new release, following the [EffVer](https://jacobtomlinson.dev/effver/) versioning scheme (e.g., releasing v0.2.0 as the next release after v0.1.0 denotes that “some small effort may be required to make sure this version works for you”).
+2. Write a high-level summary of the changes in this release, and write it into the release notes in `docs/releases.rst`. Create and merge a PR which adds the summary and also changes the release notes to say today's date and the version number of the new release. Don't add the blank template for future releases yet.
+3. Navigate to the [https://github.com/zarr-developers/virtualizarr/releases](https://github.com/zarr-developers/virtualizarr/releases) releases page.
+4. Select 'Draft a new release'.
+5. Select 'Choose a tag', then 'Create a new tag'
+6. Enter the name for the new tag (i.e. the release version number).
+7. Click 'Generate Release Notes' to draft notes based on merged pull requests, and paste the same release summary you wrote earlier at the top.
+8. Edit the draft release notes for consistency.
+9. Select 'Publish' to publish the release. This should automatically upload the new release to [PyPI](https://pypi.org/project/virtualizarr/) and [conda-forge](https://anaconda.org/conda-forge/virtualizarr).
+10. Check that this has run successfully (PyPI should show the new version number very quickly, but conda-forge might take several hours).
+11. Create and merge a PR to add a new empty section to the `docs/releases.rst` for the next release in the future.
+12. (Optional) Advertise the release on social media 📣
diff --git a/docs/examples.md b/docs/examples.md
@@ -0,0 +1,7 @@
+# Examples
+
+The following examples demonstrate the use of VirtualiZarr to create virtual datasets of various kinds:
+
+1. [Appending new daily NOAA SST data to Icechunk](https://github.com/zarr-developers/VirtualiZarr/blob/main/examples/append/noaa-cdr-sst.ipynb)
+2. [Parallel reference generation using Coiled Functions](https://github.com/zarr-developers/VirtualiZarr/blob/main/examples/coiled/terraclimate.ipynb)
+3. [Serverless parallel reference generation using Lithops](https://github.com/zarr-developers/VirtualiZarr/tree/main/examples/virtualizarr-with-lithops)