-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for fixed width UTF8 strings - #270 #278
Conversation
This doesn't seem to work with UTF-8 strings that are sent as binary from the REST VOL. I'll try to get a test set up for this case in python. |
It seems that HSDS treats the length-in-characters of the string as its datatype size. Then when a binary request comes in, the length-in-bytes of the same string is seen as being too large for the datatype (if one of the UTF-8 characters is multi-byte). HDF5 considers the |
Handling requests to write fixed-length UTF8 strings in binary instead of JSON is problematic with how numpy stores unicode strings. When a client makes a binary write request, HSDS attempts to read the binary buffer into a numpy array with Encoding the given UTF-8 binary to UTF-32 doesn't preserve the size, so the fixed length utf8 strings will no longer have a uniform length in bytes. This prevents Creating a numpy unicode string datatype with a size that is one fourth the byte-length of the client's UTF-8 bytestring (so that numpy's internal datataype size matches the bytestring's actual size) allows the This doesn't come up when writing the strings as JSON, since moving the data into the correct shape is handled by I'll create a PR with tests to illustrate this issue, though I'm not sure how to resolve it at the moment. |
I've added @mattjala binary request tests and fixed some issues with UTF8 encoding... |
Running these tests with a fresh environment caused them to pass. It seems that one of my dependencies was outdated, and that was changing the specifics of the encoding. Once the attribute binary test is in, this should be good to merge. |
@mattjala - take a look at the revised PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Tests pass on my machine as well, and it works with the REST VOL's writes.
Update for UTF8 fixed width strings