Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing a name with colon : fails in zarr in windows #219

Closed
h-mayorquin opened this issue Sep 11, 2024 · 11 comments · Fixed by hdmf-dev/hdmf#1202
Closed

Writing a name with colon : fails in zarr in windows #219

h-mayorquin opened this issue Sep 11, 2024 · 11 comments · Fixed by hdmf-dev/hdmf#1202
Assignees
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Milestone

Comments

@h-mayorquin
Copy link

This produces an error on windows (see the name of the TimeSeries)

from pynwb.testing.mock.file import mock_NWBFile
from pynwb.image import TimeSeries


nwbfile = mock_NWBFile()
time_series = TimeSeries(name="Name: name", rate=1.0, data=[1, 2, 3], unit="unit")
nwbfile.add_acquisition(time_series)

from hdmf_zarr.nwb import NWBZarrIO
nwbfile_path = "test.nwb"

with NWBZarrIO(path=nwbfile_path, mode="w") as io:
    io.write(nwbfile)

Error trace:

Click here for error trace { "name": "KeyError", "message": "'acquisition/Name: name/.zgroup'", "stack": "--------------------------------------------------------------------------- NotADirectoryError Traceback (most recent call last) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\zarr\\storage.py:1136, in DirectoryStore.__setitem__(self, key, value) 1135 try: -> 1136 os.makedirs(dir_path) 1137 except OSError as e:

File :225, in makedirs(name, mode, exist_ok)

NotADirectoryError: [WinError 267] The directory name is invalid: 'c:\\Users\\heberto\\development\
euroconv\\test.nwb\\acquisition/Name: name'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
Cell In[5], line 13
10 nwbfile_path = "test.nwb"
12 with NWBZarrIO(path=nwbfile_path, mode="w") as io:
---> 13 io.write(nwbfile)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf\utils.py:668, in docval..dec..func_call(*args, **kwargs)
666 def func_call(*args, **kwargs):
667 pargs = _check_args(args, kwargs)
--> 668 return func(args[0], **pargs)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf_zarr\backend.py:277, in ZarrIO.write(self, **kwargs)
267 cache_spec, number_of_jobs, max_threads_per_process, multiprocessing_context = popargs(
268 "cache_spec", "number_of_jobs", "max_threads_per_process", "multiprocessing_context", kwargs
269 )
271 self.__dci_queue = ZarrIODataChunkIteratorQueue(
272 number_of_jobs=number_of_jobs,
273 max_threads_per_process=max_threads_per_process,
274 multiprocessing_context=multiprocessing_context,
275 )
--> 277 super(ZarrIO, self).write(**kwargs)
278 if cache_spec:
279 self.__cache_spec()

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf\utils.py:668, in docval..dec..func_call(*args, **kwargs)
666 def func_call(*args, **kwargs):
667 pargs = _check_args(args, kwargs)
--> 668 return func(args[0], **pargs)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf\backends\io.py:99, in HDMFIO.write(self, **kwargs)
97 """Write a container to the IO source."""
98 f_builder = self.__manager.build(container, source=self.__source, root=True)
---> 99 self.write_builder(f_builder, **kwargs)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf\utils.py:668, in docval..dec..func_call(*args, **kwargs)
666 def func_call(*args, **kwargs):
667 pargs = _check_args(args, kwargs)
--> 668 return func(args[0], **pargs)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf_zarr\backend.py:440, in ZarrIO.write_builder(self, **kwargs)
436 f_builder, link_data, exhaust_dci, export_source, consolidate_metadata = getargs(
437 'builder', 'link_data', 'exhaust_dci', 'export_source', 'consolidate_metadata', kwargs
438 )
439 for name, gbldr in f_builder.groups.items():
--> 440 self.write_group(
441 parent=self.__file,
442 builder=gbldr,
443 link_data=link_data,
444 exhaust_dci=exhaust_dci,
445 export_source=export_source,
446 )
447 for name, dbldr in f_builder.datasets.items():
448 self.write_dataset(
449 parent=self.__file,
450 builder=dbldr,
(...)
453 export_source=export_source,
454 )

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf\utils.py:668, in docval..dec..func_call(*args, **kwargs)
666 def func_call(*args, **kwargs):
667 pargs = _check_args(args, kwargs)
--> 668 return func(args[0], **pargs)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf_zarr\backend.py:527, in ZarrIO.write_group(self, **kwargs)
525 if subgroups:
526 for subgroup_name, sub_builder in subgroups.items():
--> 527 self.write_group(
528 parent=group,
529 builder=sub_builder,
530 link_data=link_data,
531 exhaust_dci=exhaust_dci,
532 )
534 datasets = builder.datasets
535 if datasets:

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf\utils.py:668, in docval..dec..func_call(*args, **kwargs)
666 def func_call(*args, **kwargs):
667 pargs = _check_args(args, kwargs)
--> 668 return func(args[0], **pargs)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\hdmf_zarr\backend.py:522, in ZarrIO.write_group(self, **kwargs)
520 group = parent[builder.name]
521 else:
--> 522 group = parent.require_group(builder.name)
524 subgroups = builder.groups
525 if subgroups:

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\zarr\hierarchy.py:1025, in Group.require_group(self, name, overwrite)
1000 def require_group(self, name, overwrite=False):
1001 """Obtain a sub-group, creating one if it doesn't exist.
1002
1003 Parameters
(...)
1022
1023 """
-> 1025 return self._write_op(self._require_group_nosync, name, overwrite=overwrite)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\zarr\hierarchy.py:952, in Group._write_op(self, f, *args, **kwargs)
949 lock = self._synchronizer[group_meta_key]
951 with lock:
--> 952 return f(*args, **kwargs)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\zarr\hierarchy.py:1032, in Group._require_group_nosync(self, name, overwrite)
1030 # create terminal group if necessary
1031 if not contains_group(self._store, path):
-> 1032 init_group(
1033 store=self._store, path=path, chunk_store=self._chunk_store, overwrite=overwrite
1034 )
1036 return Group(
1037 self._store,
1038 path=path,
(...)
1043 zarr_version=self._version,
1044 )

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\zarr\storage.py:677, in init_group(store, overwrite, path, chunk_store)
674 store["zarr.json"] = store._metadata_class.encode_hierarchy_metadata(None) # type: ignore
676 # initialise metadata
--> 677 _init_group_metadata(store=store, overwrite=overwrite, path=path, chunk_store=chunk_store)
679 if store_version == 3:
680 # TODO: Should initializing a v3 group also create a corresponding
681 # empty folder under data/root/? I think probably not until there
682 # is actual data written there.
683 pass

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\zarr\storage.py:738, in _init_group_metadata(store, overwrite, path, chunk_store)
736 key = _prefix_to_group_key(store, _path_to_prefix(path))
737 if hasattr(store, "_metadata_class"):
--> 738 store[key] = store._metadata_class.encode_group_metadata(meta)
739 else:
740 store[key] = encode_group_metadata(meta)

File c:\Users\heberto\miniconda3\envs
euro\Lib\site-packages\zarr\storage.py:1139, in DirectoryStore.setitem(self, key, value)
1137 except OSError as e:
1138 if e.errno != errno.EEXIST:
-> 1139 raise KeyError(key)
1141 # write to temporary file
1142 # note we're not using tempfile.NamedTemporaryFile to avoid restrictive file permissions
1143 temp_name = file_name + "." + uuid.uuid4().hex + ".partial"

KeyError: 'acquisition/Name: name/.zgroup'"
}

Whereas this works prefectly:

from pynwb.testing.mock.file import mock_NWBFile
from pynwb.image import TimeSeries


nwbfile = mock_NWBFile()
time_series = TimeSeries(name="Name name", rate=1.0, data=[1, 2, 3], unit="unit")
nwbfile.add_acquisition(time_series)

from hdmf_zarr.nwb import NWBZarrIO
nwbfile_path = "test.nwb"

with NWBZarrIO(path=nwbfile_path, mode="w") as io:
    io.write(nwbfile)
@oruebel
Copy link
Contributor

oruebel commented Sep 11, 2024

Thanks for reporting this issue. Unfortunately I don't think there is much we can do about this. ":" is not allowed in file names on Windows, and since Zarr translates the TimeSeries to a folder on disk, the name must be a valid name.

@h-mayorquin
Copy link
Author

Perhaps we can make user experience better?

A similar error is hdf5 limitation about slashes in names:

from pynwb.testing.mock.file import mock_NWBFile
from pynwb.image import TimeSeries


nwbfile = mock_NWBFile()
time_series = TimeSeries(name="Name/name", rate=1.0, data=[1, 2, 3], unit="unit")
nwbfile.add_acquisition(time_series)

nwbfile_path = "test.nwb"

from pynwb import HDMFIO

with HDMFIO(path=nwbfile_path, mode="w") as io:
    io.write(nwbfile)

Which outputs a value error:

ValueError: name 'Name/name' cannot contain '/'

Perhaps we could have something similar?

More deeply this seems like a leaky abstraction where an implementation detail (the naming scheme in the backend) is leaked to something that the user can control directly. It is an interesting coupling.

@oruebel
Copy link
Contributor

oruebel commented Sep 11, 2024

Perhaps we can make user experience better?

A similar error is hdf5 limitation about slashes in names:

Adding an error check in ZarrIO to improve reporting sounds reasonable. We would need to add a check for both groups and datasets, so we'd probably need a helper function to check if a path-name is permitted.

More deeply this seems like a leaky abstraction where an implementation detail (the naming scheme in the backend) is leaked to something that the user can control directly. It is an interesting coupling.

There are certain limitations of storage backends that are tricky to work around. One approach could be to have docval validate names when creating new objects (i.e., before we call write) to prevent creation of bad names.

@h-mayorquin
Copy link
Author

Adding an error check in ZarrIO to improve reporting sounds reasonable. We would need to add a check for both groups and datasets, so we'd probably need a helper function to check if a path-name is permitted.

Make sense.

Thinking about this, this raises other questions:

  1. Should we disallow the use of ":" in names for other OSes as well (not just windows)? Currently, you can build a zarr file in Linux that can't be moved to Windows, right? that seems like undesirable.
  2. Should it be disallowed for other backends? Otherwise, you can build an NWBFile with hdf5 backend that can't be repacked to zarr.

@oruebel
Copy link
Contributor

oruebel commented Sep 11, 2024

Thinking about this, this raises other questions:

1. Should we disallow the use of ":" in names for other OSes as well (not just windows)? Currently,  you can build a zarr file in Linux that can't be moved to Windows, right? that seems like undesirable.

2. Should it be disallowed for other backends? Otherwise, you can build an NWBFile with hdf5 backend that can't be repacked to zarr.

@rly @bendichter thoughts on this?

@bendichter
Copy link
Contributor

I think for 1 yes, we should prohibit ":" on other operating systems, as I would argue cross-OS compatibility is a core feature of NWB.

For 2, I think maybe not. I don't think interoperability between backends is necessarily a core feature, though perhaps we should put this as an NWB Best Practice.

there's another question that comes up for me:
3. Should we disallow extensions that have ":" in the name? I think the answer is that ":" should be disallowed from pinned names in extensions, because I think we would like these extensions to be compatible with all the supported backends

@rly
Copy link
Contributor

rly commented Sep 17, 2024

I agree with @bendichter on 1 and 3 and am ambivalent on 2. Allowing but discouraging people from putting ":" in the name feels a little in conflict with preventing people from requiring the name to have a ":". I understand the reason.

A simpler rule is to just not allow ":" in the name for all backends moving forward. I would say most, if not all, people do not care much about the names of their NWB objects.

@h-mayorquin
Copy link
Author

I think that one advantage of disallowing :, / in general is that it would be easier to maintain and communicate. Just a plain rule instead of diffs and code spread across the two backends.

@bendichter
Copy link
Contributor

My thoughts on (2): I'm thinking of a use-case where the user wants to write HDF5-based NWB files. They want to use object names that reflect their source data structure in some specific way that includes the : character. Then we say they can't and they ask why and we say "because it won't work for Zarr." They might be frustrated by that answer because they don't care about using Zarr anyway. I would consider cross-language and cross-OS support to be core features of NWB, so it makes sense to constrain for those, but cross-backend support doesn't seem as essential.

On the other hand, we may down the line decide it's better to store all NWB data on DANDI as Zarr objects, and it might then be much nicer if we already had constraints in place that allowed us to do that more easily.

I see some advantages and disadvantages. I tend to err on the side of freedom, since you can never fully anticipate every use-case and you are potentially causing a big headache for someone by making constraints you don't need to.

@mavaylon1 mavaylon1 added category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Oct 17, 2024
@mavaylon1 mavaylon1 added this to the Future milestone Oct 17, 2024
@rly
Copy link
Contributor

rly commented Nov 7, 2024

We discussed this at our issue-a-thon today and 4 out of 4 voted for strictness: raise an error when trying to create an AbstractContainer with a ':' in the name, but allow reads of the object. While we understand erring on the side of freedom, 1) we do not see particularly strong use cases for this, 2) allowing this could create edge cases and frustrated users who encounter an HDF5 file that they cannot export to Zarr, and 3) the rule is easier to maintain and communicate. If strictness becomes an issue, we can revert this to a warning.

@rly
Copy link
Contributor

rly commented Nov 7, 2024

This should be implemented in HDMF in a major release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants