Append to icechunk stores #272

abarciauskas-bgse · 2024-10-25T16:08:18Z

This resizes the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.

Also Zarr append ref: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/array.py#L1134-L1186

Closes Support appending to icechunk store #311
Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst
New functions/methods are listed in api.rst
New functionality has documentation

virtualizarr/writers/icechunk.py

TomNicholas

All this does at the moment is resize the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.

I think that's great! Does xarray have any similar logic in it?

Also this is not fully working yet, it is getting a decompression error 😭

This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.

virtualizarr/writers/icechunk.py

TomNicholas · 2024-10-25T17:34:32Z

virtualizarr/writers/icechunk.py

+    mode = store.mode.str
+
+    # Aimee: resize the array if it already exists
+    # TODO: assert chunking and encoding is the same


Should also test that it raises a clear error if you try to append with chunks of a different dtype etc. I would hope zarr-python would throw that for us.

virtualizarr/writers/icechunk.py

TomNicholas · 2024-10-25T17:39:15Z

virtualizarr/writers/icechunk.py

+            existing_num_chunks = int(
+                existing_size / existing_array.chunks[append_axis]
+            )


There's a whole beartrap here around noticing if the last chunk is smaller than the other chunks. We should throw in that case (because zarr can't support it without variable-length chunks).

virtualizarr/writers/icechunk.py

abarciauskas-bgse · 2024-10-25T19:51:24Z

I think that's great! Does xarray have any similar logic in it?

In the case of appending to a zarr store using xarray,

From what I can tell, resizing happens here. (Btw if someone can explain write_region to me I would appreciate it, I couldn't find good documentation anywhere).
For writing actual chunks of data to keys, I believe that currently happens here (in zarr.array._set_selection. I will continue to dig into how it's working in xarray and zarr to understand how it should work here.

Also this is not fully working yet, it is getting a decompression error 😭

This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.

Yes I think my next step will be to write some simple tests.

codecov · 2024-10-26T22:04:10Z

Codecov Report

Attention: Patch coverage is 88.61985% with 47 lines in your changes missing coverage. Please review.

Project coverage is 93.31%. Comparing base (3d7a4be) to head (7dc9186).
Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
virtualizarr/tests/test_writers/test_icechunk.py	83.97%	25 Missing ⚠️
virtualizarr/codecs.py	77.77%	10 Missing ⚠️
virtualizarr/manifests/utils.py	85.45%	8 Missing ⚠️
virtualizarr/writers/icechunk.py	94.23%	3 Missing ⚠️
virtualizarr/accessor.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #272       +/-   ##
===========================================
+ Coverage   77.76%   93.31%   +15.54%     
===========================================
  Files          48       51        +3     
  Lines        3378     3876      +498     
===========================================
+ Hits         2627     3617      +990     
+ Misses        751      259      -492

Flag	Coverage Δ
unittests	`93.31% <88.61%> (+15.54%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

for more information, see https://pre-commit.ci

TomNicholas

Thanks @abarciauskas-bgse ! I have a lot of smaller comments, but generally I think this is looking really promising!

virtualizarr/tests/test_writers/test_icechunk.py

virtualizarr/tests/test_writers/test_icechunk_append.py

TomNicholas · 2024-11-07T18:47:47Z

virtualizarr/tests/test_writers/test_icechunk_append.py

+    icechunk_filestore.commit(
+        "test commit"
+    )  # need to commit it in order to append to it in the next lines


I'm confused why that would be the case. What goes wrong if you write without committing, then append?

Is it to do with the mode?

We need to open the existing store in append mode in order to append otherwise I get the error:

zarr.errors.ContainsGroupError: A group exists in store <icechunk.IcechunkStore object at 0x10eaf9100> at path ''.

That's the error that I get just trying to use the store object from IcechunkStore.create(. But if I do use a store with mode='a' but do not commit to the first store object, I get the following error:

FileNotFoundError: <icechunk.IcechunkStore object at 0x10960d490>

virtualizarr/tests/test_writers/test_icechunk_append.py

virtualizarr/writers/icechunk.py

TomNicholas · 2024-11-08T16:28:25Z

virtualizarr/writers/icechunk.py

+        # determine number of existing chunks along the append axis
+        existing_num_chunks = num_chunks(
+            array=group[name],
+            axis=append_axis,
+        )

-    # creates array if it doesn't already exist
-    arr = group.require_array(
-        name=name,
-        shape=zarray.shape,
-        chunk_shape=zarray.chunks,
-        dtype=encode_dtype(zarray.dtype),
-        codecs=zarray._v3_codec_pipeline(),
-        dimension_names=var.dims,
-        fill_value=zarray.fill_value,
-        # TODO fill_value?
-    )
-
-    # TODO it would be nice if we could assign directly to the .attrs property
-    for k, v in var.attrs.items():
-        arr.attrs[k] = encode_zarr_attr_value(v)
-    arr.attrs["_ARRAY_DIMENSIONS"] = encode_zarr_attr_value(var.dims)
+        # resize the array
+        arr = resize_array(
+            group=group,
+            name=name,
+            var=var,
+            append_axis=append_axis,
+        )


Here you determine existing_num_chunks, but then don't actually use it until inside write_manifest_virtual_refs. I think you could move the num_chunks call inside write_manifest_virtual_refs, and eliminate the need to pass the existing_num_chunks arg down.

Ah yes you're right but the challenge is, after we resize the array, than the function existing_num_chunks will not return the right size. I will think about if there is a better way to handle this, so we don't have to pass the existing_num_chunks arg around

virtualizarr/writers/icechunk.py

abarciauskas-bgse · 2024-12-05T03:20:34Z

Thanks @mpiannucci but it's still not working for me... when I run IcechunkStore.create it returns a coroutine, not a store. I'm confounded! Not trying to be difficult, I swear! See https://github.com/zarr-developers/VirtualiZarr/blob/e38823c20134858029188260bb834669a202b13e/noaa-cdr-sst.ipynb

I just actually tried with icechunk 0.1.0a5 and I'm also getting a coroutine instead of a store. So something else must be going on...

mpiannucci · 2024-12-05T03:24:51Z

I believe you, that's very frustrating! That is also very confusing because it is definitely not an async function in a7!

https://github.com/earth-mover/icechunk/blob/a8c33b81d329c32a8e34e7151a1a71620967067a/icechunk-python/python/icechunk/__init__.py#L133

Is there any chance your environment is crossed up? Does icechunk.__version__ match the pip version output?

abarciauskas-bgse · 2024-12-05T03:27:24Z

@mpiannucci yes my environment (well at least 1 of them) was messed up. Another one is working 😅 so we should be good now. i'm going to merge your version of the notebook and move it to the examples directory after cleaning up some extra cells...

Co-authored-by: Aimee Barciauskas <[email protected]>

abarciauskas-bgse · 2024-12-05T03:55:08Z

ok @TomNicholas I think this is finally g2g, I added to release docs and checked the autogenerated API documentation for to_icehunk.

As far as the example I added in examples/ ... perhaps in another PR we can add this to the virtualizarr docs or icechunk docs.

TomNicholas

LGTM!

ci/upstream.yml

pyproject.toml

virtualizarr/accessor.py

virtualizarr/tests/test_writers/test_icechunk.py

Co-authored-by: Tom Nicholas <[email protected]>

for more information, see https://pre-commit.ci

Co-authored-by: Tom Nicholas <[email protected]>

TomNicholas · 2024-12-05T16:49:28Z

(@abarciauskas-bgse you should have merge rights)

…zarr into icechunk-append

Initial attempt at appending

d3a4048

abarciauskas-bgse temporarily deployed to test-release October 25, 2024 16:08 — with GitHub Actions Inactive

abarciauskas-bgse requested review from norlandrhagen and TomNicholas October 25, 2024 16:10

norlandrhagen reviewed Oct 25, 2024

View reviewed changes

virtualizarr/writers/icechunk.py Show resolved Hide resolved

norlandrhagen reviewed Oct 25, 2024

View reviewed changes

virtualizarr/writers/icechunk.py Show resolved Hide resolved

TomNicholas added enhancement New feature or request Icechunk 🧊 Relates to Icechunk library / spec labels Oct 25, 2024

TomNicholas reviewed Oct 25, 2024

View reviewed changes

abarciauskas-bgse self-assigned this Oct 25, 2024

abarciauskas-bgse added 2 commits October 25, 2024 16:10

Working on tests for generate chunk key function

5d5f9e2

Linting

360ea14

abarciauskas-bgse temporarily deployed to test-release October 26, 2024 19:25 — with GitHub Actions Inactive

Refactor gen virtual dataset method

d3c2851

abarciauskas-bgse temporarily deployed to test-release October 26, 2024 22:03 — with GitHub Actions Inactive

abarciauskas-bgse added 2 commits October 27, 2024 16:56

Fix spelling

a7a1e50

Linting

0365a45

abarciauskas-bgse temporarily deployed to test-release October 28, 2024 00:02 — with GitHub Actions Inactive

Linting

5846d7e

abarciauskas-bgse temporarily deployed to test-release October 28, 2024 23:07 — with GitHub Actions Inactive

Linting

66bbd6e

abarciauskas-bgse temporarily deployed to test-release October 30, 2024 01:55 — with GitHub Actions Inactive

Passing compression test

000c68f

abarciauskas-bgse temporarily deployed to test-release November 1, 2024 22:34 — with GitHub Actions Inactive

TomNicholas and others added 2 commits November 5, 2024 11:38

Merge branch 'main' into icechunk-append

3131167

[pre-commit.ci] auto fixes from pre-commit.com hooks

5906687

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release November 5, 2024 18:39 Inactive

TomNicholas reviewed Nov 8, 2024

View reviewed changes

print store

e38823c

abarciauskas-bgse temporarily deployed to test-release December 5, 2024 03:20 — with GitHub Actions Inactive

Update notebook (#327)

ad17b83

Co-authored-by: Aimee Barciauskas <[email protected]>

abarciauskas-bgse temporarily deployed to test-release December 5, 2024 03:35 — with GitHub Actions Inactive

Add append to examples

8b9a830

abarciauskas-bgse temporarily deployed to test-release December 5, 2024 03:46 — with GitHub Actions Inactive

Add to releases.rst

3f9f58c

abarciauskas-bgse temporarily deployed to test-release December 5, 2024 03:52 — with GitHub Actions Inactive

Revert change to .gitignore

8496359

abarciauskas-bgse temporarily deployed to test-release December 5, 2024 03:54 — with GitHub Actions Inactive

Merge branch 'main' into icechunk-append

7dc9186

TomNicholas temporarily deployed to test-release December 5, 2024 05:14 — with GitHub Actions Inactive

TomNicholas approved these changes Dec 5, 2024

View reviewed changes

ci/upstream.yml Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

virtualizarr/accessor.py Outdated Show resolved Hide resolved

virtualizarr/tests/test_writers/test_icechunk.py Outdated Show resolved Hide resolved

abarciauskas-bgse and others added 4 commits December 5, 2024 08:25

Update ci/upstream.yml

491b701

Co-authored-by: Tom Nicholas <[email protected]>

Update pyproject.toml

94ef469

Co-authored-by: Tom Nicholas <[email protected]>

Update virtualizarr/tests/test_writers/test_icechunk.py

fad188b

Co-authored-by: Tom Nicholas <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8df67e9

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release December 5, 2024 16:26 Inactive

Update virtualizarr/accessor.py

258c92f

Co-authored-by: Tom Nicholas <[email protected]>

abarciauskas-bgse temporarily deployed to test-release December 5, 2024 16:27 — with GitHub Actions Inactive

abarciauskas-bgse added 2 commits December 5, 2024 08:58

Separate out multiple arrays test

84a4d01

Merge branch 'icechunk-append' of github.com:zarr-developers/virtuali…

299f580

…zarr into icechunk-append

abarciauskas-bgse temporarily deployed to test-release December 5, 2024 16:59 — with GitHub Actions Inactive

abarciauskas-bgse merged commit 4d85a03 into main Dec 5, 2024
11 checks passed

TomNicholas deleted the icechunk-append branch December 5, 2024 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append to icechunk stores #272

Append to icechunk stores #272

abarciauskas-bgse commented Oct 25, 2024 •

edited

Loading

TomNicholas left a comment

TomNicholas Oct 25, 2024

TomNicholas Oct 25, 2024

abarciauskas-bgse commented Oct 25, 2024

codecov bot commented Oct 26, 2024 •

edited

Loading

TomNicholas left a comment

TomNicholas Nov 7, 2024

TomNicholas Nov 7, 2024

abarciauskas-bgse Nov 12, 2024

TomNicholas Nov 8, 2024

abarciauskas-bgse Nov 11, 2024

abarciauskas-bgse commented Dec 5, 2024 •

edited

Loading

mpiannucci commented Dec 5, 2024 •

edited

Loading

abarciauskas-bgse commented Dec 5, 2024

abarciauskas-bgse commented Dec 5, 2024

TomNicholas left a comment

TomNicholas commented Dec 5, 2024

Append to icechunk stores #272

Append to icechunk stores #272

Conversation

abarciauskas-bgse commented Oct 25, 2024 • edited Loading

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas Oct 25, 2024

Choose a reason for hiding this comment

TomNicholas Oct 25, 2024

Choose a reason for hiding this comment

abarciauskas-bgse commented Oct 25, 2024

codecov bot commented Oct 26, 2024 • edited Loading

Codecov Report

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas Nov 7, 2024

Choose a reason for hiding this comment

TomNicholas Nov 7, 2024

Choose a reason for hiding this comment

abarciauskas-bgse Nov 12, 2024

Choose a reason for hiding this comment

TomNicholas Nov 8, 2024

Choose a reason for hiding this comment

abarciauskas-bgse Nov 11, 2024

Choose a reason for hiding this comment

abarciauskas-bgse commented Dec 5, 2024 • edited Loading

mpiannucci commented Dec 5, 2024 • edited Loading

abarciauskas-bgse commented Dec 5, 2024

abarciauskas-bgse commented Dec 5, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas commented Dec 5, 2024

abarciauskas-bgse commented Oct 25, 2024 •

edited

Loading

codecov bot commented Oct 26, 2024 •

edited

Loading

abarciauskas-bgse commented Dec 5, 2024 •

edited

Loading

mpiannucci commented Dec 5, 2024 •

edited

Loading