Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmrpp root and nested group parsing fix #265

Merged
merged 15 commits into from
Nov 8, 2024

Conversation

ayushnag
Copy link
Contributor

@ayushnag ayushnag commented Oct 21, 2024

@ayushnag
Copy link
Contributor Author

ayushnag commented Nov 5, 2024

@TomNicholas this is ready to be reviewed/merged. (cc: @danielfromearth)

One oddity is that I cannot test the dmrpp reader with real world NASA files because of this line in the kerchunk ZArray logic which changes fill_val for a float dtype array. The dmrpp reader sets fill_val to None by default as the ZArray constructor also has. However since the kerchunk code has extra logic, otherwise identical virtual datasets have a different fill_val when opened with virtualizarr netcdf vs dmrpp. Perhaps that logic should be in the ZArray constructor or only added when writing to kerchunk

@ayushnag ayushnag marked this pull request as ready for review November 5, 2024 02:05
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me but I also don't really know anything about the DMR format. I think it would be good if @betolink or somebody would take on the responsibility of reviewing DMR-related PRs to this repo.

)
# TODO: later add MUR, SWOT, TEMPO and others by using kerchunk JSON to read refs (rather than reading the whole netcdf file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly do you mean? Have a pre-generated kerchunk JSON file an open that as a virtual dataset for comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The current testing approach is to load a netcdf, generate the references, and then compare with the dmrpp parsed version. However this is not ideal for larger files (especially when these tests are part of GitHub Actions that run frequently). So the alternative is to generate kerchunk (or maybe icechunk?) references so only a small JSON needs to be read to compare the datasets. I have disabled this test for now due to the fill_val issue mentioned above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the test needs to rely on kerchunk json at all? I would quite like to move towards a testing strategy for VirtualiZarr where kerchunk is not involved for anything other than explicitly testing the kerchunk reader & writer.

What might be better is to have a small netCDF, then the expected byte ranges and so on are written into the test itself, by explicitly creating Manifest objects. Then you compare that manually-constructed virtual dataset to the virtual dataset created by the DMR parser in open_virtual_dataset.

In general I think it was a mistake for me to create a testing paradigm that encouraged using kerchunk where it wasn't actually necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach does make sense in general. This specific test is sort of a "real-world" check to make sure the parser works on real NASA datasets and their associated dmrpp files. These files could have unique characteristics that the other tests have missed. The reason for kerchunk JSON is just so I can make references for those already existing netcdf files.

Maybe this could be an external test (not part of this repo) that is used to catch bugs and add updates to the main dmrpp test suite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave this for a follow-up

virtualizarr/tests/test_readers/test_dmrpp.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member

One oddity is that I cannot test the dmrpp reader with real world NASA files because of this line in the kerchunk ZArray logic which changes fill_val for a float dtype array. The dmrpp reader sets fill_val to None by default as the ZArray constructor also has. However since the kerchunk code has extra logic, otherwise identical virtual datasets have a different fill_val when opened with virtualizarr netcdf vs dmrpp. Perhaps that logic should be in the ZArray constructor or only added when writing to kerchunk

Thanks for flagging that - I can't remember exactly what the idea was but it seems like a bug or at least some workaround specific to kerchunk. Would you mind raising a new issue to track this oddity?

@danielfromearth
Copy link

danielfromearth commented Nov 5, 2024

@ayushnag, this is awesome, and the results are looking great for TEMPO!

temp

Future roadmap is to incorporate this with DataTree too, right, so subgroups are loaded together? :)

@ayushnag
Copy link
Contributor Author

ayushnag commented Nov 6, 2024

Glad to see it!

Yes, DataTree integration is on the roadmap. In fact, the function _split_groups() currently returns dict[Path, ET.Element] which matches up to DataTree.from_dict() (each ET.Element can be converted to an xr.Dataset)

@ayushnag, this is awesome, and the results are looking great for TEMPO!

temp

Future roadmap is to incorporate this with DataTree too, right, so subgroups are loaded together? :)

@TomNicholas TomNicholas mentioned this pull request Nov 8, 2024
7 tasks
# Encoding keys that should be removed from attributes and placed in xarray encoding dict
_encoding_keys = {"_FillValue", "missing_value", "scale_factor", "add_offset"}
_ENCODING_KEYS = {"_FillValue", "missing_value", "scale_factor", "add_offset"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this exact set also occurs here

_encoding_keys = {"_FillValue", "missing_value", "scale_factor", "add_offset"}

Perhaps we should make this a semi-private global variable defined in a common location? Perhaps move it to virtualizarr/backend.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raised #290 to track this

@TomNicholas TomNicholas merged commit 9e7d430 into zarr-developers:main Nov 8, 2024
11 checks passed
@TomNicholas
Copy link
Member

Excellent thank you @ayushnag !

@TomNicholas TomNicholas added the enhancement New feature or request label Nov 8, 2024
@betolink
Copy link

Great work @ayushnag!! I'm just coming back from PTO. I was going to ask, should we try to schedule a meeting with the OPeNDAP people? thanks for reviewing this @TomNicholas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DMR++ enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

open_virtual_dataset error with TEMPO dmr++
4 participants