[Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk #571

CodyCBakerPhD · 2023-09-18T02:46:16Z

replaces #557

This is the most useful piece of everything built up to this point; the application of the actual DataIO classes given a user-selected backend

These operations are still introduced as standalone tools for the time being; they can't be integrated into the main Interface/Converter workflow until the HDMF-zarr and HDMF package incompatibilities are resolved upstream

…ckend_add_configure_backend

for more information, see https://pre-commit.ci

…ckend_add_configure_backend

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py

for more information, see https://pre-commit.ci

CodyCBakerPhD · 2023-12-04T03:46:50Z

@h-mayorquin Should be ready for review

h-mayorquin · 2023-12-04T11:58:39Z

@CodyCBakerPhD conflicts and some tests still failing.

src/neuroconv/tools/nwb_helpers/_configuration_models/_base_dataset_io.py

for more information, see https://pre-commit.ci

src/neuroconv/datainterfaces/ecephys/maxwell/maxonedatainterface.py

for more information, see https://pre-commit.ci

CodyCBakerPhD · 2023-12-04T15:10:02Z

@h-mayorquin Should be good now

h-mayorquin

Some questions and this should be ready to go.

src/neuroconv/tools/nwb_helpers/_configuration_models/_hdf5_dataset_io.py

src/neuroconv/tools/nwb_helpers/_configure_backend.py

h-mayorquin · 2023-12-06T07:08:23Z

src/neuroconv/tools/nwb_helpers/_configure_backend.py

+
+        # TODO: update buffer shape in iterator, if present
+
+        nwbfile_objects[object_id].set_data_io(dataset_name=dataset_name, data_io_class=data_io_class, **data_io_kwargs)


It would be nice if we could get the neurodata object by location.

I am also curios on why the set_data_io is at the neurodata object level and not at the data datset level. That's where the IO stuff ends up right?

It would be nice if we could get the neurodata object by location.

Disagree in this case

The location is just a heuristic for allowing a human to easily and intuitively match an object ID to an object at some relative location inside the file, hence why we use that in the intermediary backend configuration class (which lists out by location in a flat way)

Whereas when interacting with the actual file object in memory, the object ID is the true identifier and allows access in a flat way instead of imposing nested structure for the object lookup (which is what the location would do if we used it here)

I am also curios on why the set_data_io is at the neurodata object level and not at the data datset level. That's where the IO stuff ends up right?

It's just the way it was implemented up on HDMF. I think it's because it's easier to define such alteration on objects of a container, which makes it a method of the container with an argument used to point to a specific dataset on that container

Disagree in this case

You disagree that it would be good to have it?

It is good to have both. Most people are used to file systems so that's easy to use but unique identifiers are important for validation and computer use. What would be the downside?

In hdf5 you can access stuff by its path within the file which I think makes sense from an usability perspective.

If you're asking/commenting in general that 'it would be nice to fetch an object in an NWB file knowing its path' then that already exists in the PyNWB API (it's that natural way of navigating group/dataset contents across the modules and submodules; but relies on closing each level of nesting)

But I think what you refer to is passing a single string that includes slashes to indicate the nested group structure (which you can also do in PyNWB; just pass it to the nwbfile._file attribute) but the location structure in the HDF5 file is not necessarily equivalent to the PyNWB module structure at all times. There are various aliases in the API that break that equivalency and even my 'location in file' isn't exactly either approach, nor is it intended to be

Also this specific lookup is a piece of internal code masking what is really going on, which is reliance on object ID; what you seem to be asking for is a more generic features of PyNWB so I'd suggest just raise an issue over there to discuss with that team. I'm just saying it's not a feature I would ever use myself

Note that the 'location in file' used in the BackendConfiguration lookup dictionary here is only a convenient association to the object ID; is not 100% guaranteed to be perfectly identical to the PyNWB nesting structure at all times for all files, just 'generally close enough to distinguish references to contents'

That is, I've made this nice heuristic for mapping references of location_in_file -> object_id and then this line of internal operation, I go from object_id -> neurodata_object

But I think what you refer to is passing a single string that includes slashes to indicate the nested group structure

Yes, exactly, that's what I was saying as a very later comment (not asking you) that it would be nice to have. For def I did not make it clear enough that is not related to this PR or even neuroconv. Sorry for making you write something larger than you would have had that context been clear.

(which you can also do in PyNWB; just pass it to the nwbfile._file attribute)

Really? I need something like this. This does not seem to work:

nwbfile._file["acquisition/ElectricalSeriesAp"] AttributeError: 'NWBFile' object has no attribute '_file'

Oh sorry, io._file

Though note "acquisition/ElectricalSeriesAp" in that case will give you the h5py.Group not the data h5py.Dataset - is that what you want? (and why do you want?? lol)

Thanks @CodyCBakerPhD :

nwbfile.get_read_io()._file["acquisition/ElectricalSeriesAp"]

This works.

The context of this is here:
SpikeInterface/spikeinterface#2298

I want the codebase for NWB and hdf5 to differ as little as possible

And in the hdf5 case when streaming it i way cheaper to just know the specific location.

src/neuroconv/tools/nwb_helpers/_configure_backend.py

...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py

h-mayorquin · 2023-12-06T07:11:09Z

...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py

+    if case_name != "unwrapped":  # TODO: eventually, even this case will be buffered automatically
+        assert nwbfile.acquisition["TestTimeSeries"].data
+
+    nwbfile_path = str(tmpdir / f"test_configure_{backend}_defaults_{case_name}_data.nwb.h5")


I don't think the string is necessary, right? Also, why the .h5 extension in the path name?

It is. NWB{format}IO require it to be a str or else it complains

The .h5 is something being debated as a practice; including multiple file suffixes to allow external programs (and us...) to quickly tell what type of backend a given NWB file is without attempting to open it first

For example, now HDFView will always recognize such files without having to go into the settings and adjust it to recognize the .nwb suffix

The current opinion seems to be that for the next couple of years, .nwb will always be interpreted as being an HDF5 by default, and .nwb.zarr will be the identifier of files written using the new Zarr backend. DANDI in particular looks like it will use this suffix approach for deciding what required validations are appropriate (they have extra validations for Zarr assetts)

But if Zarr gains popularity, it might be nice to just get ahead of things and start doing .nwb.h5 for symmetry along with that to add clarity at a glance

OK.
You can quickly check for magic numbers to check for hdf5 and zarr has the folder structure so is not a costly operation. I kind of dislike double formats probably because with my experience with parsing them in the cell explorer. The feeling will probably go away once I get used to it. If the community is moving in that direction and you want to push that convention in tests I don't see any downside (yet). Thanks for writing an explanation.

You can quickly check for magic numbers to check for hdf5 and zarr has the folder structure so is not a costly operation.

Do you have example code for this?

import h5py import os import random def create_hdf5_file(filename): with h5py.File(filename, 'w') as f: dset = f.create_dataset("default", (100,), dtype='i') def create_random_binary_file(filename): with open(filename, 'wb') as f: f.write(os.urandom(1024)) # Write 1024 random bytes def create_nwb_file(filename): from pynwb.testing.mock.file import mock_NWBFile nwbfile = mock_NWBFile() # Save the file from pynwb import NWBHDF5IO with NWBHDF5IO(filename, 'w') as io: io.write(nwbfile) def is_hdf5_file(filename): # Source for magic numbers https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml # We should find a better one though with open(filename, 'rb') as f: file_signature = f.read(8) return file_signature == b'\x89HDF\r\n\x1a\n' # Create files hdf5_filename = 'test.hdf5' random_filename = 'random.bin' nwb_filename = "nwbfile.nwb" create_nwb_file(nwb_filename) create_hdf5_file(hdf5_filename) create_random_binary_file(random_filename) # Check if files are HDF5 print(f"{hdf5_filename} is HDF5: {is_hdf5_file(hdf5_filename)}") print(f"{nwb_filename} is HDF5: {is_hdf5_file(nwb_filename)}") print(f"{random_filename} is HDF5: {is_hdf5_file(random_filename)}")

Test this. It works on my system.

Yes, now that I think about it I don't like it. I think the convention is a bad idea. It will create fragmentation how we name the files and at the end we won't be able to trust it as it is just an informal convention. We will require some checks anyway.

Yes, now that I think about it I don't like it. I think the convention is a bad idea. It will create fragmentation how we name the files and at the end we won't be able to trust it as it is just an informal convention. We will require some checks anyway.

Going to have to clarify what you like/dislike here

Also consider performance involved; surely it's super quick and easy to extract and assert a few bytes of string from the file name than it is to extract raw bytes from the file (oh, BTW, a .nwb.zarr file is actually a folder too) and assert they follow some magic structure (also what if that structure is inconsistent over HDF5 versions, or even platforms? is it guaranteed, documented and persistent?), or even to ask if something is a file vs. a folder. We might one day want to scan an entire dataset of thousands of files and ask which of them are HDF5 vs. which ones are Zarr

Also this is going to be more general, so if you could start the discussion on either the NWB or DANDI helpdesk to clarify that would be great

I know the DANDI team will push for .nwb.zarr since it makes it really easy to tell at a glance from the web view (where you don't even have the file yet, streamed or otherwise) and also requires much less work to modify on their end to accomodate a plain .nwb file that could be either HDF5 or Zarr (besides the visual distinction of whether it is a folder or not... but I will point out they actually do some work to mask .zarr assets to make them appear as files in the web)

Also consider performance involved; surely it's super quick and easy to extract and assert a few bytes of string from the file name than it is to extract raw bytes from the file

This does not matter:

I am aware that nwb.zarr is a folder, that's why I said you could test for that.

assert they follow some magic structure (also what if that structure is inconsistent over HDF5 versions, or even platforms?

Whatever it is, I am pretty sure that a magic number will be more consistent than an unenforced behavior that we will thrown on the community.

know the DANDI team will push for .nwb.zarr since it makes it really easy to tell at a glance from the web view

Dandi can control what enters their walled garden. Other tools that are used in the open can not and will have to either trust that the user used the informal convention correctly which will lead to errors or verify. I think verify by default is cheap and avoids throwing yet another norm to the users so we programmers can have it easier.

The bold at the end is basically the core of my point. It is better for us to programmers to deal with nosy input instead of subjecting our users to more pain so we have it easier. That's a design heuristic that I think is very important.

...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py

src/neuroconv/tools/nwb_helpers/_configuration_models/_hdf5_dataset_io.py

...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py

…ration/test_helpers/test_configure_backend_overrides.py

…taset_io.py Co-authored-by: Cody Baker <[email protected]>

h-mayorquin · 2023-12-07T17:36:36Z

...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py

+    dataset_configuration.buffer_shape = smaller_buffer_shape
+
+    higher_gzip_level = 5
+    if backend == "hdf5":


What about these if? They seem to be the same?

Sorry, can you clarify what the question is? What is the same?

If you're asking about the keyword level, yes - that is an API unification that I chose in the tests but it is not currently validated in either case because I haven't figured a good way to construct a Pydantic model from the compression options for either API

The Zarr case would fail later down the line if you passed a dict here with keyword pairs not valid for the specified method

The HDF5 case would only fail if the values of the dict aren't an expected type

Ohhh lol OK I see it now

I mean, I could just get rid of the if/elif I suppose, yes... less explicit but you're right it doesn't matter in this test case

Yeah, no big deal, just in case they were actually meant to not be the same. Extra lines of code are fine.

h-mayorquin · 2023-12-07T17:37:09Z

...tools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_defaults.py

+    dataset_configuration = backend_configuration.dataset_configurations["acquisition/TestTimeSeries/data"]
+    configure_backend(nwbfile=nwbfile, backend_configuration=backend_configuration)
+
+    nwbfile_path = str(tmpdir / f"test_configure_{backend}_defaults_{case_name}_data.nwb.h5")


Ah, I think is the zarr one the one that complains about the string : /

Oh, you mean the NWBHDF5IO allows a pathlib.Path nowadays?

Feel free to request Path support on HDMF-Zarr if that bugs you

Yeah, I mean, it is midly more convenient. I was asking out of curiosity mostly in case you did not know that you could pass paths. But then I discovered that with the zarr one you can not when I was expereimenting with it today.

h-mayorquin

This looks good to me. Overall I only had questions thanks for answering them @CodyCBakerPhD .

There is still the issue of naming that I put to your consideration. There is also an small if that I think is repeated on the tests. Up to you if you want to change and if you do whether you do here or in another PR is fine for me.

codecov · 2023-12-07T17:47:31Z

Codecov Report

Merging #571 (7796a69) into main (c5c4b34) will increase coverage by 91.90%.
The diff coverage is 77.77%.

Additional details and impacted files

@@            Coverage Diff            @@
##           main     #571       +/-   ##
=========================================
+ Coverage      0   91.90%   +91.90%     
=========================================
  Files         0      113      +113     
  Lines         0     5792     +5792     
=========================================
+ Hits          0     5323     +5323     
- Misses        0      469      +469

Flag	Coverage Δ
unittests	`91.90% <77.77%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...ainterfaces/ecephys/maxwell/maxonedatainterface.py	`100.00% <ø> (ø)`
src/neuroconv/tools/nwb_helpers/__init__.py	`100.00% <100.00%> (ø)`
..._helpers/_configuration_models/_base_dataset_io.py	`96.46% <100.00%> (ø)`
.../neuroconv/tools/nwb_helpers/_configure_backend.py	`100.00% <100.00%> (ø)`
..._helpers/_configuration_models/_hdf5_dataset_io.py	`71.05% <0.00%> (ø)`

... and 108 files with indirect coverage changes

CodyCBakerPhD · 2023-12-07T17:52:05Z

OK then I say two follow-ups

consider name change to make IO explicit
increase coverage of dynamic filter cases for both HDF5 and Zarr

port over functional code

425bc1c

CodyCBakerPhD requested a review from h-mayorquin September 18, 2023 02:46

CodyCBakerPhD self-assigned this Sep 18, 2023

CodyCBakerPhD mentioned this pull request Sep 18, 2023

[Backend Config III]: Configure DataIO for datasets of in-memory NWB #557

Closed

CodyCBakerPhD and others added 2 commits September 17, 2023 23:06

Merge branch 'new_backend_add_backend_configuration_tool' into new_ba…

5290ce1

…ckend_add_configure_backend

[pre-commit.ci] auto fixes from pre-commit.com hooks

a034ef4

for more information, see https://pre-commit.ci

CodyCBakerPhD changed the title ~~[Backend Configuration III] Configure all datasets in-memory to be written in specific way to ddisk~~ [Backend Configuration III] Configure all datasets in-memory to be written in specific way to disk Sep 25, 2023

Merge branch 'new_backend_add_backend_configuration_tool' into new_ba…

b2ce2fa

…ckend_add_configure_backend

CodyCBakerPhD commented Oct 3, 2023

View reviewed changes

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py Outdated Show resolved Hide resolved

Base automatically changed from new_backend_add_backend_configuration_tool to main November 28, 2023 19:41

CodyCBakerPhD and others added 9 commits November 30, 2023 16:32

resolve conflicts

6817059

[pre-commit.ci] auto fixes from pre-commit.com hooks

d2a2b8e

for more information, see https://pre-commit.ci

Merge branch 'main' into new_backend_add_configure_backend

4c0d15a

Merge branch 'main' into new_backend_add_configure_backend

84c3087

fix conflicts

b88c7fa

[pre-commit.ci] auto fixes from pre-commit.com hooks

4b783e7

for more information, see https://pre-commit.ci

debugs

e5818de

[pre-commit.ci] auto fixes from pre-commit.com hooks

13f4ae6

for more information, see https://pre-commit.ci

debug and add new validation test

b304bdb

CodyCBakerPhD marked this pull request as ready for review December 4, 2023 03:46

fix conflicts

0e9f414

CodyCBakerPhD commented Dec 4, 2023

View reviewed changes

src/neuroconv/tools/nwb_helpers/_configuration_models/_base_dataset_io.py Show resolved Hide resolved

pre-commit-ci bot and others added 3 commits December 4, 2023 14:21

[pre-commit.ci] auto fixes from pre-commit.com hooks

7398408

for more information, see https://pre-commit.ci

remove print

4700255

try to fix maxone on linux

ccced67

CodyCBakerPhD commented Dec 4, 2023

View reviewed changes

src/neuroconv/datainterfaces/ecephys/maxwell/maxonedatainterface.py Outdated Show resolved Hide resolved

CodyCBakerPhD and others added 2 commits December 4, 2023 09:54

skip maxwell test in CI

49c9177

[pre-commit.ci] auto fixes from pre-commit.com hooks

c830342

for more information, see https://pre-commit.ci

CodyCBakerPhD enabled auto-merge December 4, 2023 15:09

h-mayorquin reviewed Dec 6, 2023

View reviewed changes

CodyCBakerPhD commented Dec 6, 2023

View reviewed changes

src/neuroconv/tools/nwb_helpers/_configuration_models/_hdf5_dataset_io.py Show resolved Hide resolved

CodyCBakerPhD commented Dec 6, 2023

View reviewed changes

...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py Outdated Show resolved Hide resolved

Update tests/test_minimal/test_tools/test_backend_and_dataset_configu…

11b2f2d

…ration/test_helpers/test_configure_backend_overrides.py

h-mayorquin disabled auto-merge December 7, 2023 14:38

CodyCBakerPhD and others added 2 commits December 7, 2023 09:59

Merge branch 'main' into new_backend_add_configure_backend

be8b369

Update src/neuroconv/tools/nwb_helpers/_configuration_models/_hdf5_da…

7796a69

…taset_io.py Co-authored-by: Cody Baker <[email protected]>

h-mayorquin reviewed Dec 7, 2023

View reviewed changes

h-mayorquin approved these changes Dec 7, 2023

View reviewed changes

CodyCBakerPhD enabled auto-merge December 7, 2023 17:52

h-mayorquin mentioned this pull request Dec 7, 2023

[Draft] Ecephys integration of backend configuration #578

Closed

CodyCBakerPhD merged commit ad5fdab into main Dec 7, 2023
36 checks passed

CodyCBakerPhD deleted the new_backend_add_configure_backend branch December 7, 2023 18:10

CodyCBakerPhD changed the title ~~[Backend Configuration III] Configure all datasets in-memory to be written in specific way to disk~~ [Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk #571

[Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk #571

CodyCBakerPhD commented Sep 18, 2023

CodyCBakerPhD commented Dec 4, 2023

h-mayorquin commented Dec 4, 2023

CodyCBakerPhD commented Dec 4, 2023

h-mayorquin left a comment

h-mayorquin Dec 6, 2023

CodyCBakerPhD Dec 6, 2023

h-mayorquin Dec 7, 2023

CodyCBakerPhD Dec 7, 2023 •

edited

Loading

h-mayorquin Dec 7, 2023 •

edited

Loading

CodyCBakerPhD Dec 7, 2023

CodyCBakerPhD Dec 7, 2023

h-mayorquin Dec 7, 2023

h-mayorquin Dec 7, 2023

h-mayorquin Dec 6, 2023

CodyCBakerPhD Dec 6, 2023

h-mayorquin Dec 7, 2023

CodyCBakerPhD Dec 7, 2023

h-mayorquin Dec 7, 2023 •

edited

Loading

h-mayorquin Dec 7, 2023

h-mayorquin Dec 7, 2023

CodyCBakerPhD Dec 7, 2023 •

edited

Loading

h-mayorquin Dec 7, 2023 •

edited

Loading

h-mayorquin Dec 7, 2023

CodyCBakerPhD Dec 7, 2023

CodyCBakerPhD Dec 7, 2023

CodyCBakerPhD Dec 7, 2023

h-mayorquin Dec 7, 2023

h-mayorquin Dec 7, 2023

CodyCBakerPhD Dec 7, 2023

h-mayorquin Dec 7, 2023 •

edited

Loading

h-mayorquin left a comment

codecov bot commented Dec 7, 2023

CodyCBakerPhD commented Dec 7, 2023


		# TODO: update buffer shape in iterator, if present

		nwbfile_objects[object_id].set_data_io(dataset_name=dataset_name, data_io_class=data_io_class, **data_io_kwargs)

[Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk #571

[Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk #571

Conversation

CodyCBakerPhD commented Sep 18, 2023

CodyCBakerPhD commented Dec 4, 2023

h-mayorquin commented Dec 4, 2023

CodyCBakerPhD commented Dec 4, 2023

h-mayorquin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CodyCBakerPhD Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

h-mayorquin Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-mayorquin Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CodyCBakerPhD Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

h-mayorquin Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-mayorquin Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

h-mayorquin left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 7, 2023

Codecov Report

CodyCBakerPhD commented Dec 7, 2023

CodyCBakerPhD Dec 7, 2023 •

edited

Loading

h-mayorquin Dec 7, 2023 •

edited

Loading

h-mayorquin Dec 7, 2023 •

edited

Loading

CodyCBakerPhD Dec 7, 2023 •

edited

Loading

h-mayorquin Dec 7, 2023 •

edited

Loading

h-mayorquin Dec 7, 2023 •

edited

Loading