Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk #571

Merged
merged 22 commits into from
Dec 7, 2023

Conversation

CodyCBakerPhD
Copy link
Member

replaces #557

This is the most useful piece of everything built up to this point; the application of the actual DataIO classes given a user-selected backend

These operations are still introduced as standalone tools for the time being; they can't be integrated into the main Interface/Converter workflow until the HDMF-zarr and HDMF package incompatibilities are resolved upstream

@CodyCBakerPhD CodyCBakerPhD changed the title [Backend Configuration III] Configure all datasets in-memory to be written in specific way to ddisk [Backend Configuration III] Configure all datasets in-memory to be written in specific way to disk Sep 25, 2023
Base automatically changed from new_backend_add_backend_configuration_tool to main November 28, 2023 19:41
@CodyCBakerPhD CodyCBakerPhD marked this pull request as ready for review December 4, 2023 03:46
@CodyCBakerPhD
Copy link
Member Author

@h-mayorquin Should be ready for review

@h-mayorquin
Copy link
Collaborator

@CodyCBakerPhD conflicts and some tests still failing.

@CodyCBakerPhD
Copy link
Member Author

@h-mayorquin Should be good now

Copy link
Collaborator

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and this should be ready to go.


# TODO: update buffer shape in iterator, if present

nwbfile_objects[object_id].set_data_io(dataset_name=dataset_name, data_io_class=data_io_class, **data_io_kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could get the neurodata object by location.

I am also curios on why the set_data_io is at the neurodata object level and not at the data datset level. That's where the IO stuff ends up right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could get the neurodata object by location.

Disagree in this case

The location is just a heuristic for allowing a human to easily and intuitively match an object ID to an object at some relative location inside the file, hence why we use that in the intermediary backend configuration class (which lists out by location in a flat way)

Whereas when interacting with the actual file object in memory, the object ID is the true identifier and allows access in a flat way instead of imposing nested structure for the object lookup (which is what the location would do if we used it here)

I am also curios on why the set_data_io is at the neurodata object level and not at the data datset level. That's where the IO stuff ends up right?

It's just the way it was implemented up on HDMF. I think it's because it's easier to define such alteration on objects of a container, which makes it a method of the container with an argument used to point to a specific dataset on that container

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree in this case

You disagree that it would be good to have it?

It is good to have both. Most people are used to file systems so that's easy to use but unique identifiers are important for validation and computer use. What would be the downside?

In hdf5 you can access stuff by its path within the file which I think makes sense from an usability perspective.

Copy link
Member Author

@CodyCBakerPhD CodyCBakerPhD Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're asking/commenting in general that 'it would be nice to fetch an object in an NWB file knowing its path' then that already exists in the PyNWB API (it's that natural way of navigating group/dataset contents across the modules and submodules; but relies on closing each level of nesting)

But I think what you refer to is passing a single string that includes slashes to indicate the nested group structure (which you can also do in PyNWB; just pass it to the nwbfile._file attribute) but the location structure in the HDF5 file is not necessarily equivalent to the PyNWB module structure at all times. There are various aliases in the API that break that equivalency and even my 'location in file' isn't exactly either approach, nor is it intended to be

Also this specific lookup is a piece of internal code masking what is really going on, which is reliance on object ID; what you seem to be asking for is a more generic features of PyNWB so I'd suggest just raise an issue over there to discuss with that team. I'm just saying it's not a feature I would ever use myself

Note that the 'location in file' used in the BackendConfiguration lookup dictionary here is only a convenient association to the object ID; is not 100% guaranteed to be perfectly identical to the PyNWB nesting structure at all times for all files, just 'generally close enough to distinguish references to contents'

That is, I've made this nice heuristic for mapping references of location_in_file -> object_id and then this line of internal operation, I go from object_id -> neurodata_object

Copy link
Collaborator

@h-mayorquin h-mayorquin Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think what you refer to is passing a single string that includes slashes to indicate the nested group structure

Yes, exactly, that's what I was saying as a very later comment (not asking you) that it would be nice to have. For def I did not make it clear enough that is not related to this PR or even neuroconv. Sorry for making you write something larger than you would have had that context been clear.

(which you can also do in PyNWB; just pass it to the nwbfile._file attribute)

Really? I need something like this. This does not seem to work:

nwbfile._file["acquisition/ElectricalSeriesAp"]
AttributeError: 'NWBFile' object has no attribute '_file'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, io._file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though note "acquisition/ElectricalSeriesAp" in that case will give you the h5py.Group not the data h5py.Dataset - is that what you want? (and why do you want?? lol)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CodyCBakerPhD :

nwbfile.get_read_io()._file["acquisition/ElectricalSeriesAp"]

This works.

The context of this is here:
SpikeInterface/spikeinterface#2298

I want the codebase for NWB and hdf5 to differ as little as possible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in the hdf5 case when streaming it i way cheaper to just know the specific location.

if case_name != "unwrapped": # TODO: eventually, even this case will be buffered automatically
assert nwbfile.acquisition["TestTimeSeries"].data

nwbfile_path = str(tmpdir / f"test_configure_{backend}_defaults_{case_name}_data.nwb.h5")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the string is necessary, right? Also, why the .h5 extension in the path name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. NWB{format}IO require it to be a str or else it complains

The .h5 is something being debated as a practice; including multiple file suffixes to allow external programs (and us...) to quickly tell what type of backend a given NWB file is without attempting to open it first

For example, now HDFView will always recognize such files without having to go into the settings and adjust it to recognize the .nwb suffix

The current opinion seems to be that for the next couple of years, .nwb will always be interpreted as being an HDF5 by default, and .nwb.zarr will be the identifier of files written using the new Zarr backend. DANDI in particular looks like it will use this suffix approach for deciding what required validations are appropriate (they have extra validations for Zarr assetts)

But if Zarr gains popularity, it might be nice to just get ahead of things and start doing .nwb.h5 for symmetry along with that to add clarity at a glance

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
You can quickly check for magic numbers to check for hdf5 and zarr has the folder structure so is not a costly operation. I kind of dislike double formats probably because with my experience with parsing them in the cell explorer. The feeling will probably go away once I get used to it. If the community is moving in that direction and you want to push that convention in tests I don't see any downside (yet). Thanks for writing an explanation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can quickly check for magic numbers to check for hdf5 and zarr has the folder structure so is not a costly operation.

Do you have example code for this?

Copy link
Collaborator

@h-mayorquin h-mayorquin Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import h5py
import os
import random

def create_hdf5_file(filename):
    with h5py.File(filename, 'w') as f:
        dset = f.create_dataset("default", (100,), dtype='i')

def create_random_binary_file(filename):
    with open(filename, 'wb') as f:
        f.write(os.urandom(1024))  # Write 1024 random bytes

def create_nwb_file(filename):
    from pynwb.testing.mock.file import mock_NWBFile
    
    nwbfile = mock_NWBFile()
    # Save the file
    from pynwb import NWBHDF5IO
    with NWBHDF5IO(filename, 'w') as io:
        io.write(nwbfile)
    
    

def is_hdf5_file(filename):
    # Source for magic numbers https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml
    # We should find a better one though
    with open(filename, 'rb') as f:
        file_signature = f.read(8)
    return file_signature == b'\x89HDF\r\n\x1a\n'

# Create files
hdf5_filename = 'test.hdf5'
random_filename = 'random.bin'
nwb_filename = "nwbfile.nwb"

create_nwb_file(nwb_filename)
create_hdf5_file(hdf5_filename)
create_random_binary_file(random_filename)

# Check if files are HDF5
print(f"{hdf5_filename} is HDF5: {is_hdf5_file(hdf5_filename)}")
print(f"{nwb_filename} is HDF5: {is_hdf5_file(nwb_filename)}")
print(f"{random_filename} is HDF5: {is_hdf5_file(random_filename)}")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test this. It works on my system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that I think about it I don't like it. I think the convention is a bad idea. It will create fragmentation how we name the files and at the end we won't be able to trust it as it is just an informal convention. We will require some checks anyway.

Copy link
Member Author

@CodyCBakerPhD CodyCBakerPhD Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that I think about it I don't like it. I think the convention is a bad idea. It will create fragmentation how we name the files and at the end we won't be able to trust it as it is just an informal convention. We will require some checks anyway.

Going to have to clarify what you like/dislike here

Also consider performance involved; surely it's super quick and easy to extract and assert a few bytes of string from the file name than it is to extract raw bytes from the file (oh, BTW, a .nwb.zarr file is actually a folder too) and assert they follow some magic structure (also what if that structure is inconsistent over HDF5 versions, or even platforms? is it guaranteed, documented and persistent?), or even to ask if something is a file vs. a folder. We might one day want to scan an entire dataset of thousands of files and ask which of them are HDF5 vs. which ones are Zarr

Also this is going to be more general, so if you could start the discussion on either the NWB or DANDI helpdesk to clarify that would be great

I know the DANDI team will push for .nwb.zarr since it makes it really easy to tell at a glance from the web view (where you don't even have the file yet, streamed or otherwise) and also requires much less work to modify on their end to accomodate a plain .nwb file that could be either HDF5 or Zarr (besides the visual distinction of whether it is a folder or not... but I will point out they actually do some work to mask .zarr assets to make them appear as files in the web)

Copy link
Collaborator

@h-mayorquin h-mayorquin Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also consider performance involved; surely it's super quick and easy to extract and assert a few bytes of string from the file name than it is to extract raw bytes from the file

This does not matter:
image

I am aware that nwb.zarr is a folder, that's why I said you could test for that.

assert they follow some magic structure (also what if that structure is inconsistent over HDF5 versions, or even platforms?

Whatever it is, I am pretty sure that a magic number will be more consistent than an unenforced behavior that we will thrown on the community.

know the DANDI team will push for .nwb.zarr since it makes it really easy to tell at a glance from the web view

Dandi can control what enters their walled garden. Other tools that are used in the open can not and will have to either trust that the user used the informal convention correctly which will lead to errors or verify. I think verify by default is cheap and avoids throwing yet another norm to the users so we programmers can have it easier.

The bold at the end is basically the core of my point. It is better for us to programmers to deal with nosy input instead of subjecting our users to more pain so we have it easier. That's a design heuristic that I think is very important.

…ration/test_helpers/test_configure_backend_overrides.py
@h-mayorquin h-mayorquin disabled auto-merge December 7, 2023 14:38
dataset_configuration.buffer_shape = smaller_buffer_shape

higher_gzip_level = 5
if backend == "hdf5":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about these if? They seem to be the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you clarify what the question is? What is the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're asking about the keyword level, yes - that is an API unification that I chose in the tests but it is not currently validated in either case because I haven't figured a good way to construct a Pydantic model from the compression options for either API

The Zarr case would fail later down the line if you passed a dict here with keyword pairs not valid for the specified method

The HDF5 case would only fail if the values of the dict aren't an expected type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh lol OK I see it now

I mean, I could just get rid of the if/elif I suppose, yes... less explicit but you're right it doesn't matter in this test case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no big deal, just in case they were actually meant to not be the same. Extra lines of code are fine.

dataset_configuration = backend_configuration.dataset_configurations["acquisition/TestTimeSeries/data"]
configure_backend(nwbfile=nwbfile, backend_configuration=backend_configuration)

nwbfile_path = str(tmpdir / f"test_configure_{backend}_defaults_{case_name}_data.nwb.h5")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think is the zarr one the one that complains about the string : /

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean the NWBHDF5IO allows a pathlib.Path nowadays?

Feel free to request Path support on HDMF-Zarr if that bugs you

Copy link
Collaborator

@h-mayorquin h-mayorquin Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean, it is midly more convenient. I was asking out of curiosity mostly in case you did not know that you could pass paths. But then I discovered that with the zarr one you can not when I was expereimenting with it today.

Copy link
Collaborator

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Overall I only had questions thanks for answering them @CodyCBakerPhD .

There is still the issue of naming that I put to your consideration. There is also an small if that I think is repeated on the tests. Up to you if you want to change and if you do whether you do here or in another PR is fine for me.

Copy link

codecov bot commented Dec 7, 2023

Codecov Report

Merging #571 (7796a69) into main (c5c4b34) will increase coverage by 91.90%.
The diff coverage is 77.77%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           main     #571       +/-   ##
=========================================
+ Coverage      0   91.90%   +91.90%     
=========================================
  Files         0      113      +113     
  Lines         0     5792     +5792     
=========================================
+ Hits          0     5323     +5323     
- Misses        0      469      +469     
Flag Coverage Δ
unittests 91.90% <77.77%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ainterfaces/ecephys/maxwell/maxonedatainterface.py 100.00% <ø> (ø)
src/neuroconv/tools/nwb_helpers/__init__.py 100.00% <100.00%> (ø)
..._helpers/_configuration_models/_base_dataset_io.py 96.46% <100.00%> (ø)
.../neuroconv/tools/nwb_helpers/_configure_backend.py 100.00% <100.00%> (ø)
..._helpers/_configuration_models/_hdf5_dataset_io.py 71.05% <0.00%> (ø)

... and 108 files with indirect coverage changes

@CodyCBakerPhD
Copy link
Member Author

OK then I say two follow-ups

  • consider name change to make IO explicit
  • increase coverage of dynamic filter cases for both HDF5 and Zarr

@CodyCBakerPhD CodyCBakerPhD merged commit ad5fdab into main Dec 7, 2023
36 checks passed
@CodyCBakerPhD CodyCBakerPhD deleted the new_backend_add_configure_backend branch December 7, 2023 18:10
@CodyCBakerPhD CodyCBakerPhD changed the title [Backend Configuration III] Configure all datasets in-memory to be written in specific way to disk [Backend Configuration IIIa] Configure all datasets in-memory to be written in specific way to disk Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants