-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node name character set #56
Comments
The main problem you will hit in my opinion is Unicode normalisation, would I'm not sure about unicode versions as new characters get added. Thinking at loud we can't really ignore codepoints we don't know, if a new codepoint is now combining (Surrogates?) it may change the meaning of the subsequents bytes depending on encoding. You might also kill performance with some implementations as parsing unicode and deciding the length or wether to split on If anything rely on sorting as well then you may become For FFI you may want to actually normalize on a given encoding. On the UI side, you may get rendering and homoglyphs/heteroglyphs confusions, plus input issues. In Python Web browser are also really bad at respecting unicode combining, typically chrome (depending in the version), think that I would not worry too much about the ability of filesystem/KVstore to handle arbitrary unicode as long as you decouple the storage spec from the querying spec. For a file store we could decide to sore the keys/path-parts as base64 worse case, and then casing/unicode should not be a compatibility issue, but might be a performance one. I would ere toward more restrictive (maximum stay within the basic multilingual plane) it's always easier to relax this later. |
Another point to consider, usually in URL, hostnames mostly, are encoded using Punycode , I don't know if that's of any concern or relevance. |
Crosslinking #149 (comment) |
Where have we landed on this? Are we going to expand the set of allowed characters in node names? Over in #149, @jbms said
I agree as well. I think we should go back to a broader range of allowed characters. Perhaps it would be more useful to define what characters are explicitly forbidden. For example, is whitespace allowed? |
I think it would be fine to allow, subject to the limitations of the underlying store, arbitrary Unicode code point sequences, but reserve "/" for the hierarchy and reserve leading underscore after the last "/" for zarr's own usage (all metadata keys should use a leading underscore). We can suggest a subset of the full namespace that will have better compatibility across stores. |
Unicode normalization was mentioned at some point in the discussion but I'm not seeing that reflected in PR #196. The Python language also limits characters allowed in Python identifiers to a limited set of Unicode categories. For instance it does not allow control characters (Cc) or format characters (Cf). Oddly, to me anyway, it also doesn't allow most of the punctuation categories. NetCDF requires Unicode normalization for identifiers/names (enforces, I believe, @DennisHeimbigner?) but does not limit by Unicode categories. I suspect it should limit by categories but not sure what real problems it would avoid. (Sorry if I've missed some of this already being addressed. My lurking here is pretty sporadic.) |
Thanks for raising this point @ethanrd. Disallowing control and format characters seems fine to me, and unicode normalization is a good idea too, IMO. I'd propose to use Unicode Normalization Form C (NFC), requiring implementations to normalize any user input and interacting with the storage only via normalized unicode. |
Is there an example of a use case where Unicode normalization is important? Requiring Unicode normalization means applications must contain Unicode data tables. TensorStore, for instance, does not currently depend on Unicode data tables, so this would require adding them. Given that we decided to make case sensitivity store-dependent, it seems that it would be reasonable to also leave Unicode normalization as store-dependent unless there is a compelling use case. |
True, that would require additional complexity, I would rather not make this required then. Since #196 already proposes a safe subset, I feel like normalization or limiting the character set do not need to be enforced on the spec level. From #196:
Simply typing a string in different editors or different keyboards might result in two different unicode representations in rare cases, even if they look alike. Therefore unicode normalization would be nice to have if we want to encourage unicode usage, but I think we're fine with the recommendation above. We could additionally recommend to forbid CF/CC characters and also recommend normalization, which seems easy in languages with built-in unicode tables support. cc @ethanrd @DennisHeimbigner @jbms |
As I understand it, the goal of this proposal is to move beyond I do not think alternate encodings for the same characters is rare. It sounds like the existence of alternate encodings was for legacy and backward compatibility reasons (see the Precomposed character wikipedia page, second paragraph). One of the reasons was a clean mapping to/from ISO Latin-1 (see this Unicode email) which, I believe, could make this fairly common. I'm not sure I understand the argument against requiring Unicode because it would require a Unicode library. Wouldn't a dependency on Unicode libraries fall on the Zarr libraries rather than the applications or underlying storage libraries that are using or being used by the Zarr libraries? Also, (I don't know for sure but) I suspect that the Unicode character category limitations used for Python language identifiers corresponds in some way to the lower case ( |
I can see that without normalization, it may be quite difficult and error prone to manually type node names with non-ascii characters, since depending on the operating system and input configuration, the representation may differ. In practice, it would probably be necessary to either normalize in the application when writing and reading, or rely on listing and then do normalization-aware matching on the list results. However, in addition to requiring extra data tables, there are other potential issues:
Yes, I do mean that it imposes the dependency on zarr libraries, not directly on the application using the zarr library. For tensorstore this added dependency for not be that big of a deal. I could imagine that for a zarr implementation intended for an embedded device, it may be a larger issue.
In zarr v2 I don't believe there are currently any restrictions on node names, nor is any normalization performed. |
At least the v2 spec specifies keys to be ASCII strings:
👍 (we might even assume a store which is not able to list directories, which makes normalized matching impossible).
Yes, but we'd still recommend to use the limited set due to incompatibilities between underlying stores. This is not an enforced restriction, though. I think there are three options, and I'll try to add arguments for all of them here:
I tend towards 1, making unicode normalization required by writing and reading normalized keys, but no strong opinion from my side. @rabernat @Carreau @alimanfoo more opinions/arguments? I'd like to settle this soon. |
There is one other issue with normalization that occurred to me: Suppose a zarr implementation supports opening a zarr array by a local filesystem path:
Let's say that Thus we are back to needing to know the root even without storage transformers. Or maybe we should treat Unicode normalization as a storage transformer, which means we would need to use a two-part URL:
|
Actually I realized there is the same issue with the proposal to escape group member names that start with underscore. |
That would be great IMO! |
In practice, I think the reason to have normalization is for name equality testing. |
I think the multiple language support provided by Unicode is important enough that it should have the visibility of being in the core. On the other hand, given all the issues that have come up, I would now lean towards recommending rather than requiring Unicode support. Perhaps just some text after the
Of course, it's the details beyond this that get complicated. Thoughts? Note: The Python PEP 3131 "Supporting Non-ASCII Identifiers" does a good job explaining the motivation for supporting Unicode and the decisions they made on limiting allowed characters. The Unicode Standard Annex #31 "Unicode Identifier and Pattern Syntax" provides a recommended default for defining identifiers, it includes some reasoning behind the default and reasons and cautions for altering the default. |
I think we haven't fully ironed out the "extension" vs "core" part, but I think the more relevant distinction here is "always enabled" vs "opt in via storage transformer". Even if it is a storage transformer, it could still be supported by all or most implementations. Logically, Unicode normalization as a feature does behave as a storage transformer that impacts path resolution, and that has implications for how the array needs to be accessed. We decided to eliminate the concept of an explicit root, in order to allow the zarr v2-like behavior of being able to access an array directly via its path, as long as there are no group-level storage transformers in place. For example, let's say we have:
If With Unicode normalization, we technically could access the array directly as As for how to describe Unicode normalization in the spec, for interoperability between implementations it is critical to precisely specify the normalization form to use. |
Hi @jbms - Yes, I'll just mention that I've been following Zarr for some time but I'm not deeply familiar with the details of the spec or implementations (other than netCDF at a non-Zarr level). So I don't really understand how Zarr extensions and storage transforms fit into the spec or into the various implementations. In my last comment, I was suggesting a recommendation in the core because, to me, an extension sounds less visible than a mention in the core. Unicode support seems important enough that visibility is important, whether optional or not. I agree, it is important to provide precise details about which normalization scheme to use and, possibly, any limits on the characters allowed. The Unicode Standard Annex # 31 I mentioned in my earlier comment provides a recommended default syntax for the definition of identifiers that is probably a good starting point. Anything beyond that is well past my current understanding of Unicode. Reviewing the various Unicode documents has just reaffirmed for me that Unicode can get complicated very quickly. |
I just noticed what I believe is a discrepancy in the Node Names section of the v3 spec. The third paragraph says
Then the fourth paragraph starts with
My understanding is that case folding is similar to (but more complicated than) converting each character in a string to uppercase. So it would result in the opposite of the case sensitive nature of node names described in paragraph three. I think case sensitive is the intent and so suggest that "case-folded" be removed from paragraph four. Here's a W3C reference for case folding (and case mapping): https://www.w3.org/TR/charmod-norm/#definitionCaseFolding |
The first paragraph means implementations must be case sensitive and case preserving, the second is just a recommendation to users about how to name their nodes. So I don't think there is a discrepancy here. |
Perhaps not a discrepancy. However, if node names are case sensitive, I don't think it makes sense to recommend that users use case folding which changes the case of certain characters (for instance, case folding would change "ABCdefG" to "abcdefg"). If node names are indeed case sensitive, users should be able to use the case that is appropriate for each letter in the name they have decided to use. Also, I don't think users should have to understand NFKC or case folding. Those developing implementations need to understand them and decide how and when to use them, but not users. |
Currently the core protocol v3.0 draft includes a section on node names which defines the set of characters that may be used to construct the name of a node (array or group).
The set of allowed characters is currently extremely narrow, being only the union of
a-z
,A-Z
,0-9
,-_.
. There is no support for non-latin characters, which obviously creates a barrier for many people. I'd like to relax this and allow any Unicode characters. Could we do this, and if we did, what problems would we need to anticipate and address in the core protocol spec?Some points to bear in mind for this discussion:
Currently in the core protocol, node names (e.g., "bar") are used to form node paths (e.g., "/foo/bar"), which are then used to form storage keys for metadata documents (e.g., "meta/root/foo/bar.array") and data chunks (e.g., "data/foo/bar/0.0"). These storage keys may then be handled by a variety of different store types, including file system stores where storage keys are translated into file paths, cloud object stores where storage keys are translated into object keys, etc.
Different store types will have different abilities to support the full Unicode character set for node names. For example, although most file systems support Unicode file names, there are still reserved characters and words which cannot be used, which differ between operating systems and file system types. However, these constraints might not apply at all to other store types, such as cloud object stores. In general, other store types may have different constraints. Do we need to anticipate any of this, or can we delegate these issues to be dealt with in the different store specs?
One last thought, whatever we decide, the set of allowed characters should probably be defined with respect to some standard character set, e.g., Unicode. I.e., we should probably reference the appropriate standard when discussing which characters are allowed.
The text was updated successfully, but these errors were encountered: