-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Append to icechunk stores #272
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this does at the moment is resize the arrays which are being appended to and, probably too naïvely, increments the
append_dim
index of the chunk key by an offset of the existing number of chunks along the append dimension.
I think that's great! Does xarray have any similar logic in it?
Also this is not fully working yet, it is getting a decompression error 😭
This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.
virtualizarr/writers/icechunk.py
Outdated
mode = store.mode.str | ||
|
||
# Aimee: resize the array if it already exists | ||
# TODO: assert chunking and encoding is the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also test that it raises a clear error if you try to append with chunks of a different dtype etc. I would hope zarr-python would throw that for us.
virtualizarr/writers/icechunk.py
Outdated
existing_num_chunks = int( | ||
existing_size / existing_array.chunks[append_axis] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a whole beartrap here around noticing if the last chunk is smaller than the other chunks. We should throw in that case (because zarr can't support it without variable-length chunks).
In the case of appending to a zarr store using xarray,
Yes I think my next step will be to write some simple tests. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #272 +/- ##
===========================================
+ Coverage 77.76% 93.31% +15.54%
===========================================
Files 48 51 +3
Lines 3378 3876 +498
===========================================
+ Hits 2627 3617 +990
+ Misses 751 259 -492
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abarciauskas-bgse ! I have a lot of smaller comments, but generally I think this is looking really promising!
icechunk_filestore.commit( | ||
"test commit" | ||
) # need to commit it in order to append to it in the next lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why that would be the case. What goes wrong if you write without committing, then append?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it to do with the mode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to open the existing store in append mode in order to append otherwise I get the error:
zarr.errors.ContainsGroupError: A group exists in store <icechunk.IcechunkStore object at 0x10eaf9100> at path ''.
That's the error that I get just trying to use the store object from IcechunkStore.create(
. But if I do use a store with mode='a' but do not commit to the first store object, I get the following error:
FileNotFoundError: <icechunk.IcechunkStore object at 0x10960d490>
virtualizarr/writers/icechunk.py
Outdated
# determine number of existing chunks along the append axis | ||
existing_num_chunks = num_chunks( | ||
array=group[name], | ||
axis=append_axis, | ||
) | ||
|
||
# creates array if it doesn't already exist | ||
arr = group.require_array( | ||
name=name, | ||
shape=zarray.shape, | ||
chunk_shape=zarray.chunks, | ||
dtype=encode_dtype(zarray.dtype), | ||
codecs=zarray._v3_codec_pipeline(), | ||
dimension_names=var.dims, | ||
fill_value=zarray.fill_value, | ||
# TODO fill_value? | ||
) | ||
|
||
# TODO it would be nice if we could assign directly to the .attrs property | ||
for k, v in var.attrs.items(): | ||
arr.attrs[k] = encode_zarr_attr_value(v) | ||
arr.attrs["_ARRAY_DIMENSIONS"] = encode_zarr_attr_value(var.dims) | ||
# resize the array | ||
arr = resize_array( | ||
group=group, | ||
name=name, | ||
var=var, | ||
append_axis=append_axis, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you determine existing_num_chunks
, but then don't actually use it until inside write_manifest_virtual_refs
. I think you could move the num_chunks
call inside write_manifest_virtual_refs
, and eliminate the need to pass the existing_num_chunks
arg down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes you're right but the challenge is, after we resize the array, than the function existing_num_chunks will not return the right size. I will think about if there is a better way to handle this, so we don't have to pass the existing_num_chunks arg around
I just actually tried with icechunk 0.1.0a5 and I'm also getting a coroutine instead of a store. So something else must be going on... |
I believe you, that's very frustrating! That is also very confusing because it is definitely not an async function in a7! Is there any chance your environment is crossed up? Does |
@mpiannucci yes my environment (well at least 1 of them) was messed up. Another one is working 😅 so we should be good now. i'm going to merge your version of the notebook and move it to the examples directory after cleaning up some extra cells... |
Co-authored-by: Aimee Barciauskas <[email protected]>
ok @TomNicholas I think this is finally g2g, I added to release docs and checked the autogenerated API documentation for to_icehunk. As far as the example I added in examples/ ... perhaps in another PR we can add this to the virtualizarr docs or icechunk docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Co-authored-by: Tom Nicholas <[email protected]>
Co-authored-by: Tom Nicholas <[email protected]>
Co-authored-by: Tom Nicholas <[email protected]>
for more information, see https://pre-commit.ci
Co-authored-by: Tom Nicholas <[email protected]>
(@abarciauskas-bgse you should have merge rights) |
…zarr into icechunk-append
This resizes the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.
Also Zarr append ref: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/array.py#L1134-L1186
docs/releases.rst
api.rst