Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append to icechunk stores #272

Merged
merged 101 commits into from
Dec 5, 2024
Merged

Append to icechunk stores #272

merged 101 commits into from
Dec 5, 2024

Conversation

abarciauskas-bgse
Copy link
Collaborator

@abarciauskas-bgse abarciauskas-bgse commented Oct 25, 2024

This resizes the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.

Also Zarr append ref: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/array.py#L1134-L1186

  • Closes Support appending to icechunk store #311
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@TomNicholas TomNicholas added enhancement New feature or request Icechunk 🧊 Relates to Icechunk library / spec labels Oct 25, 2024
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this does at the moment is resize the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.

I think that's great! Does xarray have any similar logic in it?

Also this is not fully working yet, it is getting a decompression error 😭

This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.

virtualizarr/writers/icechunk.py Show resolved Hide resolved
virtualizarr/writers/icechunk.py Show resolved Hide resolved
mode = store.mode.str

# Aimee: resize the array if it already exists
# TODO: assert chunking and encoding is the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also test that it raises a clear error if you try to append with chunks of a different dtype etc. I would hope zarr-python would throw that for us.

virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
Comment on lines 156 to 158
existing_num_chunks = int(
existing_size / existing_array.chunks[append_axis]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a whole beartrap here around noticing if the last chunk is smaller than the other chunks. We should throw in that case (because zarr can't support it without variable-length chunks).

virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
virtualizarr/writers/icechunk.py Show resolved Hide resolved
@abarciauskas-bgse abarciauskas-bgse self-assigned this Oct 25, 2024
@abarciauskas-bgse
Copy link
Collaborator Author

I think that's great! Does xarray have any similar logic in it?

In the case of appending to a zarr store using xarray,

  • From what I can tell, resizing happens here. (Btw if someone can explain write_region to me I would appreciate it, I couldn't find good documentation anywhere).
  • For writing actual chunks of data to keys, I believe that currently happens here (in zarr.array._set_selection. I will continue to dig into how it's working in xarray and zarr to understand how it should work here.

Also this is not fully working yet, it is getting a decompression error 😭

This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.

Yes I think my next step will be to write some simple tests.

Copy link

codecov bot commented Oct 26, 2024

Codecov Report

Attention: Patch coverage is 88.61985% with 47 lines in your changes missing coverage. Please review.

Project coverage is 93.31%. Comparing base (3d7a4be) to head (7dc9186).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
virtualizarr/tests/test_writers/test_icechunk.py 83.97% 25 Missing ⚠️
virtualizarr/codecs.py 77.77% 10 Missing ⚠️
virtualizarr/manifests/utils.py 85.45% 8 Missing ⚠️
virtualizarr/writers/icechunk.py 94.23% 3 Missing ⚠️
virtualizarr/accessor.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #272       +/-   ##
===========================================
+ Coverage   77.76%   93.31%   +15.54%     
===========================================
  Files          48       51        +3     
  Lines        3378     3876      +498     
===========================================
+ Hits         2627     3617      +990     
+ Misses        751      259      -492     
Flag Coverage Δ
unittests 93.31% <88.61%> (+15.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abarciauskas-bgse ! I have a lot of smaller comments, but generally I think this is looking really promising!

virtualizarr/tests/test_writers/test_icechunk.py Outdated Show resolved Hide resolved
virtualizarr/tests/test_writers/test_icechunk_append.py Outdated Show resolved Hide resolved
virtualizarr/tests/test_writers/test_icechunk_append.py Outdated Show resolved Hide resolved
Comment on lines 126 to 128
icechunk_filestore.commit(
"test commit"
) # need to commit it in order to append to it in the next lines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why that would be the case. What goes wrong if you write without committing, then append?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it to do with the mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to open the existing store in append mode in order to append otherwise I get the error:

zarr.errors.ContainsGroupError: A group exists in store <icechunk.IcechunkStore object at 0x10eaf9100> at path ''.

That's the error that I get just trying to use the store object from IcechunkStore.create(. But if I do use a store with mode='a' but do not commit to the first store object, I get the following error:

FileNotFoundError: <icechunk.IcechunkStore object at 0x10960d490>

virtualizarr/tests/test_writers/test_icechunk_append.py Outdated Show resolved Hide resolved
virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
Comment on lines 199 to 211
# determine number of existing chunks along the append axis
existing_num_chunks = num_chunks(
array=group[name],
axis=append_axis,
)

# creates array if it doesn't already exist
arr = group.require_array(
name=name,
shape=zarray.shape,
chunk_shape=zarray.chunks,
dtype=encode_dtype(zarray.dtype),
codecs=zarray._v3_codec_pipeline(),
dimension_names=var.dims,
fill_value=zarray.fill_value,
# TODO fill_value?
)

# TODO it would be nice if we could assign directly to the .attrs property
for k, v in var.attrs.items():
arr.attrs[k] = encode_zarr_attr_value(v)
arr.attrs["_ARRAY_DIMENSIONS"] = encode_zarr_attr_value(var.dims)
# resize the array
arr = resize_array(
group=group,
name=name,
var=var,
append_axis=append_axis,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you determine existing_num_chunks, but then don't actually use it until inside write_manifest_virtual_refs. I think you could move the num_chunks call inside write_manifest_virtual_refs, and eliminate the need to pass the existing_num_chunks arg down.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes you're right but the challenge is, after we resize the array, than the function existing_num_chunks will not return the right size. I will think about if there is a better way to handle this, so we don't have to pass the existing_num_chunks arg around

virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
@abarciauskas-bgse
Copy link
Collaborator Author

abarciauskas-bgse commented Dec 5, 2024

Thanks @mpiannucci but it's still not working for me... when I run IcechunkStore.create it returns a coroutine, not a store. I'm confounded! Not trying to be difficult, I swear! See https://github.com/zarr-developers/VirtualiZarr/blob/e38823c20134858029188260bb834669a202b13e/noaa-cdr-sst.ipynb

I just actually tried with icechunk 0.1.0a5 and I'm also getting a coroutine instead of a store. So something else must be going on...

@mpiannucci
Copy link
Contributor

mpiannucci commented Dec 5, 2024

I believe you, that's very frustrating! That is also very confusing because it is definitely not an async function in a7!

https://github.com/earth-mover/icechunk/blob/a8c33b81d329c32a8e34e7151a1a71620967067a/icechunk-python/python/icechunk/__init__.py#L133

Is there any chance your environment is crossed up? Does icechunk.__version__ match the pip version output?

@abarciauskas-bgse
Copy link
Collaborator Author

@mpiannucci yes my environment (well at least 1 of them) was messed up. Another one is working 😅 so we should be good now. i'm going to merge your version of the notebook and move it to the examples directory after cleaning up some extra cells...

Co-authored-by: Aimee Barciauskas <[email protected]>
@abarciauskas-bgse
Copy link
Collaborator Author

ok @TomNicholas I think this is finally g2g, I added to release docs and checked the autogenerated API documentation for to_icehunk.

As far as the example I added in examples/ ... perhaps in another PR we can add this to the virtualizarr docs or icechunk docs.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

ci/upstream.yml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
virtualizarr/accessor.py Outdated Show resolved Hide resolved
virtualizarr/tests/test_writers/test_icechunk.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member

(@abarciauskas-bgse you should have merge rights)

@abarciauskas-bgse abarciauskas-bgse merged commit 4d85a03 into main Dec 5, 2024
11 checks passed
@TomNicholas TomNicholas deleted the icechunk-append branch December 5, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Icechunk 🧊 Relates to Icechunk library / spec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants