-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk #571
Conversation
…ckend_add_configure_backend
for more information, see https://pre-commit.ci
…ckend_add_configure_backend
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
@h-mayorquin Should be ready for review |
@CodyCBakerPhD conflicts and some tests still failing. |
src/neuroconv/tools/nwb_helpers/_configuration_models/_base_dataset_io.py
Show resolved
Hide resolved
src/neuroconv/datainterfaces/ecephys/maxwell/maxonedatainterface.py
Outdated
Show resolved
Hide resolved
for more information, see https://pre-commit.ci
@h-mayorquin Should be good now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and this should be ready to go.
src/neuroconv/tools/nwb_helpers/_configuration_models/_hdf5_dataset_io.py
Show resolved
Hide resolved
|
||
# TODO: update buffer shape in iterator, if present | ||
|
||
nwbfile_objects[object_id].set_data_io(dataset_name=dataset_name, data_io_class=data_io_class, **data_io_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could get the neurodata object by location.
I am also curios on why the set_data_io
is at the neurodata object level and not at the data datset level. That's where the IO stuff ends up right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could get the neurodata object by location.
Disagree in this case
The location is just a heuristic for allowing a human to easily and intuitively match an object ID to an object at some relative location inside the file, hence why we use that in the intermediary backend configuration class (which lists out by location in a flat way)
Whereas when interacting with the actual file object in memory, the object ID is the true identifier and allows access in a flat way instead of imposing nested structure for the object lookup (which is what the location would do if we used it here)
I am also curios on why the set_data_io is at the neurodata object level and not at the data datset level. That's where the IO stuff ends up right?
It's just the way it was implemented up on HDMF. I think it's because it's easier to define such alteration on objects of a container, which makes it a method of the container with an argument used to point to a specific dataset on that container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree in this case
You disagree that it would be good to have it?
It is good to have both. Most people are used to file systems so that's easy to use but unique identifiers are important for validation and computer use. What would be the downside?
In hdf5 you can access stuff by its path within the file which I think makes sense from an usability perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're asking/commenting in general that 'it would be nice to fetch an object in an NWB file knowing its path' then that already exists in the PyNWB API (it's that natural way of navigating group/dataset contents across the modules and submodules; but relies on closing each level of nesting)
But I think what you refer to is passing a single string that includes slashes to indicate the nested group structure (which you can also do in PyNWB; just pass it to the nwbfile._file
attribute) but the location structure in the HDF5 file is not necessarily equivalent to the PyNWB module structure at all times. There are various aliases in the API that break that equivalency and even my 'location in file' isn't exactly either approach, nor is it intended to be
Also this specific lookup is a piece of internal code masking what is really going on, which is reliance on object ID; what you seem to be asking for is a more generic features of PyNWB so I'd suggest just raise an issue over there to discuss with that team. I'm just saying it's not a feature I would ever use myself
Note that the 'location in file' used in the BackendConfiguration
lookup dictionary here is only a convenient association to the object ID; is not 100% guaranteed to be perfectly identical to the PyNWB nesting structure at all times for all files, just 'generally close enough to distinguish references to contents'
That is, I've made this nice heuristic for mapping references of location_in_file
-> object_id
and then this line of internal operation, I go from object_id
-> neurodata_object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think what you refer to is passing a single string that includes slashes to indicate the nested group structure
Yes, exactly, that's what I was saying as a very later comment (not asking you) that it would be nice to have. For def I did not make it clear enough that is not related to this PR or even neuroconv. Sorry for making you write something larger than you would have had that context been clear.
(which you can also do in PyNWB; just pass it to the nwbfile._file attribute)
Really? I need something like this. This does not seem to work:
nwbfile._file["acquisition/ElectricalSeriesAp"]
AttributeError: 'NWBFile' object has no attribute '_file'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, io._file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though note "acquisition/ElectricalSeriesAp"
in that case will give you the h5py.Group
not the data
h5py.Dataset
- is that what you want? (and why do you want?? lol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CodyCBakerPhD :
nwbfile.get_read_io()._file["acquisition/ElectricalSeriesAp"]
This works.
The context of this is here:
SpikeInterface/spikeinterface#2298
I want the codebase for NWB and hdf5 to differ as little as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in the hdf5 case when streaming it i way cheaper to just know the specific location.
...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py
Show resolved
Hide resolved
if case_name != "unwrapped": # TODO: eventually, even this case will be buffered automatically | ||
assert nwbfile.acquisition["TestTimeSeries"].data | ||
|
||
nwbfile_path = str(tmpdir / f"test_configure_{backend}_defaults_{case_name}_data.nwb.h5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the string is necessary, right? Also, why the .h5 extension in the path name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. NWB{format}IO
require it to be a str
or else it complains
The .h5
is something being debated as a practice; including multiple file suffixes to allow external programs (and us...) to quickly tell what type of backend a given NWB file is without attempting to open it first
For example, now HDFView will always recognize such files without having to go into the settings and adjust it to recognize the .nwb
suffix
The current opinion seems to be that for the next couple of years, .nwb
will always be interpreted as being an HDF5 by default, and .nwb.zarr
will be the identifier of files written using the new Zarr backend. DANDI in particular looks like it will use this suffix approach for deciding what required validations are appropriate (they have extra validations for Zarr assetts)
But if Zarr gains popularity, it might be nice to just get ahead of things and start doing .nwb.h5
for symmetry along with that to add clarity at a glance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
You can quickly check for magic numbers to check for hdf5 and zarr has the folder structure so is not a costly operation. I kind of dislike double formats probably because with my experience with parsing them in the cell explorer. The feeling will probably go away once I get used to it. If the community is moving in that direction and you want to push that convention in tests I don't see any downside (yet). Thanks for writing an explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can quickly check for magic numbers to check for hdf5 and zarr has the folder structure so is not a costly operation.
Do you have example code for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import h5py
import os
import random
def create_hdf5_file(filename):
with h5py.File(filename, 'w') as f:
dset = f.create_dataset("default", (100,), dtype='i')
def create_random_binary_file(filename):
with open(filename, 'wb') as f:
f.write(os.urandom(1024)) # Write 1024 random bytes
def create_nwb_file(filename):
from pynwb.testing.mock.file import mock_NWBFile
nwbfile = mock_NWBFile()
# Save the file
from pynwb import NWBHDF5IO
with NWBHDF5IO(filename, 'w') as io:
io.write(nwbfile)
def is_hdf5_file(filename):
# Source for magic numbers https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml
# We should find a better one though
with open(filename, 'rb') as f:
file_signature = f.read(8)
return file_signature == b'\x89HDF\r\n\x1a\n'
# Create files
hdf5_filename = 'test.hdf5'
random_filename = 'random.bin'
nwb_filename = "nwbfile.nwb"
create_nwb_file(nwb_filename)
create_hdf5_file(hdf5_filename)
create_random_binary_file(random_filename)
# Check if files are HDF5
print(f"{hdf5_filename} is HDF5: {is_hdf5_file(hdf5_filename)}")
print(f"{nwb_filename} is HDF5: {is_hdf5_file(nwb_filename)}")
print(f"{random_filename} is HDF5: {is_hdf5_file(random_filename)}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test this. It works on my system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now that I think about it I don't like it. I think the convention is a bad idea. It will create fragmentation how we name the files and at the end we won't be able to trust it as it is just an informal convention. We will require some checks anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now that I think about it I don't like it. I think the convention is a bad idea. It will create fragmentation how we name the files and at the end we won't be able to trust it as it is just an informal convention. We will require some checks anyway.
Going to have to clarify what you like/dislike here
Also consider performance involved; surely it's super quick and easy to extract and assert a few bytes of string from the file name than it is to extract raw bytes from the file (oh, BTW, a .nwb.zarr
file is actually a folder too) and assert they follow some magic structure (also what if that structure is inconsistent over HDF5 versions, or even platforms? is it guaranteed, documented and persistent?), or even to ask if something is a file vs. a folder. We might one day want to scan an entire dataset of thousands of files and ask which of them are HDF5 vs. which ones are Zarr
Also this is going to be more general, so if you could start the discussion on either the NWB or DANDI helpdesk to clarify that would be great
I know the DANDI team will push for .nwb.zarr
since it makes it really easy to tell at a glance from the web view (where you don't even have the file yet, streamed or otherwise) and also requires much less work to modify on their end to accomodate a plain .nwb
file that could be either HDF5 or Zarr (besides the visual distinction of whether it is a folder or not... but I will point out they actually do some work to mask .zarr
assets to make them appear as files in the web)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also consider performance involved; surely it's super quick and easy to extract and assert a few bytes of string from the file name than it is to extract raw bytes from the file
I am aware that nwb.zarr is a folder, that's why I said you could test for that.
assert they follow some magic structure (also what if that structure is inconsistent over HDF5 versions, or even platforms?
Whatever it is, I am pretty sure that a magic number will be more consistent than an unenforced behavior that we will thrown on the community.
know the DANDI team will push for .nwb.zarr since it makes it really easy to tell at a glance from the web view
Dandi can control what enters their walled garden. Other tools that are used in the open can not and will have to either trust that the user used the informal convention correctly which will lead to errors or verify. I think verify by default is cheap and avoids throwing yet another norm to the users so we programmers can have it easier.
The bold at the end is basically the core of my point. It is better for us to programmers to deal with nosy input instead of subjecting our users to more pain so we have it easier. That's a design heuristic that I think is very important.
...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py
Show resolved
Hide resolved
src/neuroconv/tools/nwb_helpers/_configuration_models/_hdf5_dataset_io.py
Show resolved
Hide resolved
...ools/test_backend_and_dataset_configuration/test_helpers/test_configure_backend_overrides.py
Outdated
Show resolved
Hide resolved
…ration/test_helpers/test_configure_backend_overrides.py
…taset_io.py Co-authored-by: Cody Baker <[email protected]>
dataset_configuration.buffer_shape = smaller_buffer_shape | ||
|
||
higher_gzip_level = 5 | ||
if backend == "hdf5": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about these if? They seem to be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, can you clarify what the question is? What is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're asking about the keyword level
, yes - that is an API unification that I chose in the tests but it is not currently validated in either case because I haven't figured a good way to construct a Pydantic model from the compression options for either API
The Zarr case would fail later down the line if you passed a dict
here with keyword pairs not valid for the specified method
The HDF5 case would only fail if the values of the dict aren't an expected type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh lol OK I see it now
I mean, I could just get rid of the if/elif
I suppose, yes... less explicit but you're right it doesn't matter in this test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no big deal, just in case they were actually meant to not be the same. Extra lines of code are fine.
dataset_configuration = backend_configuration.dataset_configurations["acquisition/TestTimeSeries/data"] | ||
configure_backend(nwbfile=nwbfile, backend_configuration=backend_configuration) | ||
|
||
nwbfile_path = str(tmpdir / f"test_configure_{backend}_defaults_{case_name}_data.nwb.h5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think is the zarr one the one that complains about the string : /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean the NWBHDF5IO
allows a pathlib.Path
nowadays?
Feel free to request Path
support on HDMF-Zarr if that bugs you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mean, it is midly more convenient. I was asking out of curiosity mostly in case you did not know that you could pass paths. But then I discovered that with the zarr one you can not when I was expereimenting with it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Overall I only had questions thanks for answering them @CodyCBakerPhD .
There is still the issue of naming that I put to your consideration. There is also an small if that I think is repeated on the tests. Up to you if you want to change and if you do whether you do here or in another PR is fine for me.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #571 +/- ##
=========================================
+ Coverage 0 91.90% +91.90%
=========================================
Files 0 113 +113
Lines 0 5792 +5792
=========================================
+ Hits 0 5323 +5323
- Misses 0 469 +469
Flags with carried forward coverage won't be shown. Click here to find out more.
|
OK then I say two follow-ups
|
replaces #557
This is the most useful piece of everything built up to this point; the application of the actual DataIO classes given a user-selected backend
These operations are still introduced as standalone tools for the time being; they can't be integrated into the main Interface/Converter workflow until the HDMF-zarr and HDMF package incompatibilities are resolved upstream