-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dmrpp root and nested group parsing fix #265
dmrpp root and nested group parsing fix #265
Conversation
@TomNicholas this is ready to be reviewed/merged. (cc: @danielfromearth) One oddity is that I cannot test the dmrpp reader with real world NASA files because of this line in the kerchunk ZArray logic which changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good to me but I also don't really know anything about the DMR format. I think it would be good if @betolink or somebody would take on the responsibility of reviewing DMR-related PRs to this repo.
) | ||
# TODO: later add MUR, SWOT, TEMPO and others by using kerchunk JSON to read refs (rather than reading the whole netcdf file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly do you mean? Have a pre-generated kerchunk JSON file an open that as a virtual dataset for comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The current testing approach is to load a netcdf, generate the references, and then compare with the dmrpp parsed version. However this is not ideal for larger files (especially when these tests are part of GitHub Actions that run frequently). So the alternative is to generate kerchunk (or maybe icechunk?) references so only a small JSON needs to be read to compare the datasets. I have disabled this test for now due to the fill_val
issue mentioned above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason the test needs to rely on kerchunk json at all? I would quite like to move towards a testing strategy for VirtualiZarr where kerchunk is not involved for anything other than explicitly testing the kerchunk reader & writer.
What might be better is to have a small netCDF, then the expected byte ranges and so on are written into the test itself, by explicitly creating Manifest
objects. Then you compare that manually-constructed virtual dataset to the virtual dataset created by the DMR parser in open_virtual_dataset
.
In general I think it was a mistake for me to create a testing paradigm that encouraged using kerchunk where it wasn't actually necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That approach does make sense in general. This specific test is sort of a "real-world" check to make sure the parser works on real NASA datasets and their associated dmrpp files. These files could have unique characteristics that the other tests have missed. The reason for kerchunk JSON is just so I can make references for those already existing netcdf files.
Maybe this could be an external test (not part of this repo) that is used to catch bugs and add updates to the main dmrpp test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave this for a follow-up
Thanks for flagging that - I can't remember exactly what the idea was but it seems like a bug or at least some workaround specific to kerchunk. Would you mind raising a new issue to track this oddity? |
@ayushnag, this is awesome, and the results are looking great for TEMPO! Future roadmap is to incorporate this with DataTree too, right, so subgroups are loaded together? :) |
Glad to see it! Yes, DataTree integration is on the roadmap. In fact, the function
|
# Encoding keys that should be removed from attributes and placed in xarray encoding dict | ||
_encoding_keys = {"_FillValue", "missing_value", "scale_factor", "add_offset"} | ||
_ENCODING_KEYS = {"_FillValue", "missing_value", "scale_factor", "add_offset"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this exact set also occurs here
VirtualiZarr/virtualizarr/writers/icechunk.py
Line 148 in 4ae7a19
_encoding_keys = {"_FillValue", "missing_value", "scale_factor", "add_offset"} |
Perhaps we should make this a semi-private global variable defined in a common location? Perhaps move it to virtualizarr/backend.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raised #290 to track this
Excellent thank you @ayushnag ! |
Great work @ayushnag!! I'm just coming back from PTO. I was going to ask, should we try to schedule a meeting with the OPeNDAP people? thanks for reviewing this @TomNicholas! |
open_virtual_dataset
error with TEMPO dmr++ #237