Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for fixed width UTF8 strings - #270 #278

Merged
merged 3 commits into from
Nov 3, 2023
Merged

Conversation

jreadey
Copy link
Member

@jreadey jreadey commented Oct 27, 2023

Update for UTF8 fixed width strings

@mattjala
Copy link
Contributor

mattjala commented Oct 30, 2023

This doesn't seem to work with UTF-8 strings that are sent as binary from the REST VOL. I'll try to get a test set up for this case in python.

@mattjala
Copy link
Contributor

It seems that HSDS treats the length-in-characters of the string as its datatype size. Then when a binary request comes in, the length-in-bytes of the same string is seen as being too large for the datatype (if one of the UTF-8 characters is multi-byte). HDF5 considers the size field to be length-in-bytes, so HSDS should probably do the same.

@mattjala
Copy link
Contributor

Handling requests to write fixed-length UTF8 strings in binary instead of JSON is problematic with how numpy stores unicode strings.

When a client makes a binary write request, HSDS attempts to read the binary buffer into a numpy array with np.fromstring() using a numpy datatype that is constructed with createDataType(). In the case of a fixed-length UTF8 string datatype, the constructed numpy datatype is <UXX, where XX is the length of the string datatype. Numpy uses the UTF-32 encoding where each character is always four bytes, so it expects the string given to np.fromstring() to be (about) four times larger than its UTF8 encoding in bytes is, preventing the call from succeeding.

Encoding the given UTF-8 binary to UTF-32 doesn't preserve the size, so the fixed length utf8 strings will no longer have a uniform length in bytes. This prevents np.fromstring() from being used to parse the strings into elements of a single fixed-length datatype.

Creating a numpy unicode string datatype with a size that is one fourth the byte-length of the client's UTF-8 bytestring (so that numpy's internal datataype size matches the bytestring's actual size) allows the np.fromstring() call to complete, but results in a numpy array with malformed UTF-32 strings that throws an error whenever you attempt to access an element from it.

This doesn't come up when writing the strings as JSON, since moving the data into the correct shape is handled by jsonToArray in that case.

I'll create a PR with tests to illustrate this issue, though I'm not sure how to resolve it at the moment.

@jreadey
Copy link
Member Author

jreadey commented Nov 2, 2023

I've added @mattjala binary request tests and fixed some issues with UTF8 encoding...

@mattjala
Copy link
Contributor

mattjala commented Nov 2, 2023

Running these tests with a fresh environment caused them to pass. It seems that one of my dependencies was outdated, and that was changing the specifics of the encoding. Once the attribute binary test is in, this should be good to merge.

@jreadey
Copy link
Member Author

jreadey commented Nov 2, 2023

@mattjala - take a look at the revised PR!

Copy link
Contributor

@mattjala mattjala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tests pass on my machine as well, and it works with the REST VOL's writes.

@jreadey jreadey merged commit b0446f1 into master Nov 3, 2023
10 checks passed
@jreadey jreadey deleted the fixed_utf8 branch November 3, 2023 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants