dmrpp root and nested group parsing fix #265

ayushnag · 2024-10-21T23:27:48Z

Closes open_virtual_dataset error with TEMPO dmr++ #237
Tests added
Tests passing
Changes are documented in docs/releases.rst

ayushnag · 2024-11-05T02:05:19Z

@TomNicholas this is ready to be reviewed/merged. (cc: @danielfromearth)

One oddity is that I cannot test the dmrpp reader with real world NASA files because of this line in the kerchunk ZArray logic which changes fill_val for a float dtype array. The dmrpp reader sets fill_val to None by default as the ZArray constructor also has. However since the kerchunk code has extra logic, otherwise identical virtual datasets have a different fill_val when opened with virtualizarr netcdf vs dmrpp. Perhaps that logic should be in the ZArray constructor or only added when writing to kerchunk

TomNicholas

This generally looks good to me but I also don't really know anything about the DMR format. I think it would be good if @betolink or somebody would take on the responsibility of reviewing DMR-related PRs to this repo.

TomNicholas · 2024-11-05T16:13:30Z

virtualizarr/tests/test_readers/test_dmrpp.py

    )
+    # TODO: later add MUR, SWOT, TEMPO and others by using kerchunk JSON to read refs (rather than reading the whole netcdf file)


what exactly do you mean? Have a pre-generated kerchunk JSON file an open that as a virtual dataset for comparison?

Yes. The current testing approach is to load a netcdf, generate the references, and then compare with the dmrpp parsed version. However this is not ideal for larger files (especially when these tests are part of GitHub Actions that run frequently). So the alternative is to generate kerchunk (or maybe icechunk?) references so only a small JSON needs to be read to compare the datasets. I have disabled this test for now due to the fill_val issue mentioned above

Is there a reason the test needs to rely on kerchunk json at all? I would quite like to move towards a testing strategy for VirtualiZarr where kerchunk is not involved for anything other than explicitly testing the kerchunk reader & writer.

What might be better is to have a small netCDF, then the expected byte ranges and so on are written into the test itself, by explicitly creating Manifest objects. Then you compare that manually-constructed virtual dataset to the virtual dataset created by the DMR parser in open_virtual_dataset.

In general I think it was a mistake for me to create a testing paradigm that encouraged using kerchunk where it wasn't actually necessary.

That approach does make sense in general. This specific test is sort of a "real-world" check to make sure the parser works on real NASA datasets and their associated dmrpp files. These files could have unique characteristics that the other tests have missed. The reason for kerchunk JSON is just so I can make references for those already existing netcdf files.

Maybe this could be an external test (not part of this repo) that is used to catch bugs and add updates to the main dmrpp test suite.

I think we can leave this for a follow-up

virtualizarr/tests/test_readers/test_dmrpp.py

TomNicholas · 2024-11-05T18:37:55Z

One oddity is that I cannot test the dmrpp reader with real world NASA files because of this line in the kerchunk ZArray logic which changes fill_val for a float dtype array. The dmrpp reader sets fill_val to None by default as the ZArray constructor also has. However since the kerchunk code has extra logic, otherwise identical virtual datasets have a different fill_val when opened with virtualizarr netcdf vs dmrpp. Perhaps that logic should be in the ZArray constructor or only added when writing to kerchunk

Thanks for flagging that - I can't remember exactly what the idea was but it seems like a bug or at least some workaround specific to kerchunk. Would you mind raising a new issue to track this oddity?

danielfromearth · 2024-11-05T22:07:37Z

@ayushnag, this is awesome, and the results are looking great for TEMPO!

Future roadmap is to incorporate this with DataTree too, right, so subgroups are loaded together? :)

ayushnag · 2024-11-06T21:38:20Z

Glad to see it!

Yes, DataTree integration is on the roadmap. In fact, the function _split_groups() currently returns dict[Path, ET.Element] which matches up to DataTree.from_dict() (each ET.Element can be converted to an xr.Dataset)

@ayushnag, this is awesome, and the results are looking great for TEMPO!

Future roadmap is to incorporate this with DataTree too, right, so subgroups are loaded together? :)

TomNicholas · 2024-11-08T21:04:39Z

virtualizarr/readers/dmrpp.py

    # Encoding keys that should be removed from attributes and placed in xarray encoding dict
-    _encoding_keys = {"_FillValue", "missing_value", "scale_factor", "add_offset"}
+    _ENCODING_KEYS = {"_FillValue", "missing_value", "scale_factor", "add_offset"}


I just noticed this exact set also occurs here

VirtualiZarr/virtualizarr/writers/icechunk.py

Line 148 in 4ae7a19

_encoding_keys = {"_FillValue", "missing_value", "scale_factor", "add_offset"}

Perhaps we should make this a semi-private global variable defined in a common location? Perhaps move it to virtualizarr/backend.py?

raised #290 to track this

TomNicholas · 2024-11-08T22:29:34Z

Excellent thank you @ayushnag !

betolink · 2024-11-11T20:45:56Z

Great work @ayushnag!! I'm just coming back from PTO. I was going to ask, should we try to schedule a meeting with the OPeNDAP people? thanks for reviewing this @TomNicholas!

ayushnag added 4 commits October 17, 2024 15:57

add root group and nested group support

f1c5cbf

Merge branch 'main' into dmrpp_root_group_fix

b8fbb15

Merge branch 'main' into dmrpp_root_group_fix

42fb6a2

Merge remote-tracking branch 'origin/main' into dmrpp_root_group_fix

838ddd5

ayushnag temporarily deployed to test-release October 21, 2024 23:28 — with GitHub Actions Inactive

TomNicholas added the DMR++ label Oct 22, 2024

ayushnag added 5 commits October 28, 2024 04:25

refactor functions for readability

32c1551

upgrade test suite

13eae90

resolve conflicts

9d87cbf

Merge branch 'main' into dmrpp_root_group_fix

66b3da8

fix mypy types

c7149c5

ayushnag temporarily deployed to test-release November 2, 2024 20:46 — with GitHub Actions Inactive

ayushnag added 2 commits November 4, 2024 17:49

update dmrpp default fill_val

d753e81

Merge branch 'main' into dmrpp_root_group_fix

e4a326f

ayushnag temporarily deployed to test-release November 5, 2024 01:50 — with GitHub Actions Inactive

update dmrpp default datapath

3040eeb

ayushnag temporarily deployed to test-release November 5, 2024 01:53 — with GitHub Actions Inactive

ayushnag marked this pull request as ready for review November 5, 2024 02:05

TomNicholas reviewed Nov 5, 2024

View reviewed changes

ayushnag mentioned this pull request Nov 6, 2024

Inconsistent fill_val between ZArray constructor and from_kerchunk_refs() #287

Open

add dedent() to pytest fixtures

72cecc2

ayushnag temporarily deployed to test-release November 7, 2024 01:03 — with GitHub Actions Inactive

TomNicholas mentioned this pull request Nov 8, 2024

Append to icechunk stores #272

Merged

7 tasks

Merge branch 'main' into dmrpp_root_group_fix

944779b

TomNicholas temporarily deployed to test-release November 8, 2024 20:56 — with GitHub Actions Inactive

TomNicholas approved these changes Nov 8, 2024

View reviewed changes

TomNicholas mentioned this pull request Nov 8, 2024

Global ENCODING_KEYS set #290

Open

update function docs and releases

9421960

ayushnag temporarily deployed to test-release November 8, 2024 22:21 — with GitHub Actions Inactive

TomNicholas merged commit 9e7d430 into zarr-developers:main Nov 8, 2024
11 checks passed

TomNicholas added the enhancement New feature or request label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dmrpp root and nested group parsing fix #265

dmrpp root and nested group parsing fix #265

ayushnag commented Oct 21, 2024 •

edited

Loading

ayushnag commented Nov 5, 2024

TomNicholas left a comment

TomNicholas Nov 5, 2024

ayushnag Nov 6, 2024

TomNicholas Nov 7, 2024

ayushnag Nov 7, 2024

TomNicholas Nov 8, 2024

TomNicholas commented Nov 5, 2024

danielfromearth commented Nov 5, 2024 •

edited

Loading

ayushnag commented Nov 6, 2024 •

edited

Loading

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas commented Nov 8, 2024

betolink commented Nov 11, 2024

		)
		# TODO: later add MUR, SWOT, TEMPO and others by using kerchunk JSON to read refs (rather than reading the whole netcdf file)

dmrpp root and nested group parsing fix #265

dmrpp root and nested group parsing fix #265

Conversation

ayushnag commented Oct 21, 2024 • edited Loading

ayushnag commented Nov 5, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas Nov 5, 2024

Choose a reason for hiding this comment

ayushnag Nov 6, 2024

Choose a reason for hiding this comment

TomNicholas Nov 7, 2024

Choose a reason for hiding this comment

ayushnag Nov 7, 2024

Choose a reason for hiding this comment

TomNicholas Nov 8, 2024

Choose a reason for hiding this comment

TomNicholas commented Nov 5, 2024

danielfromearth commented Nov 5, 2024 • edited Loading

ayushnag commented Nov 6, 2024 • edited Loading

TomNicholas Nov 8, 2024

Choose a reason for hiding this comment

TomNicholas Nov 8, 2024

Choose a reason for hiding this comment

TomNicholas commented Nov 8, 2024

betolink commented Nov 11, 2024

ayushnag commented Oct 21, 2024 •

edited

Loading

danielfromearth commented Nov 5, 2024 •

edited

Loading

ayushnag commented Nov 6, 2024 •

edited

Loading