add support for fixed width UTF8 strings - #270 #278

jreadey · 2023-10-27T23:08:36Z

Update for UTF8 fixed width strings

mattjala · 2023-10-30T16:22:17Z

This doesn't seem to work with UTF-8 strings that are sent as binary from the REST VOL. I'll try to get a test set up for this case in python.

mattjala · 2023-10-30T17:49:55Z

It seems that HSDS treats the length-in-characters of the string as its datatype size. Then when a binary request comes in, the length-in-bytes of the same string is seen as being too large for the datatype (if one of the UTF-8 characters is multi-byte). HDF5 considers the size field to be length-in-bytes, so HSDS should probably do the same.

mattjala · 2023-10-31T22:34:00Z

Handling requests to write fixed-length UTF8 strings in binary instead of JSON is problematic with how numpy stores unicode strings.

When a client makes a binary write request, HSDS attempts to read the binary buffer into a numpy array with np.fromstring() using a numpy datatype that is constructed with createDataType(). In the case of a fixed-length UTF8 string datatype, the constructed numpy datatype is <UXX, where XX is the length of the string datatype. Numpy uses the UTF-32 encoding where each character is always four bytes, so it expects the string given to np.fromstring() to be (about) four times larger than its UTF8 encoding in bytes is, preventing the call from succeeding.

Encoding the given UTF-8 binary to UTF-32 doesn't preserve the size, so the fixed length utf8 strings will no longer have a uniform length in bytes. This prevents np.fromstring() from being used to parse the strings into elements of a single fixed-length datatype.

Creating a numpy unicode string datatype with a size that is one fourth the byte-length of the client's UTF-8 bytestring (so that numpy's internal datataype size matches the bytestring's actual size) allows the np.fromstring() call to complete, but results in a numpy array with malformed UTF-32 strings that throws an error whenever you attempt to access an element from it.

This doesn't come up when writing the strings as JSON, since moving the data into the correct shape is handled by jsonToArray in that case.

I'll create a PR with tests to illustrate this issue, though I'm not sure how to resolve it at the moment.

jreadey · 2023-11-02T03:20:00Z

I've added @mattjala binary request tests and fixed some issues with UTF8 encoding...

mattjala · 2023-11-02T21:26:01Z

Running these tests with a fresh environment caused them to pass. It seems that one of my dependencies was outdated, and that was changing the specifics of the encoding. Once the attribute binary test is in, this should be good to merge.

jreadey · 2023-11-02T22:55:27Z

@mattjala - take a look at the revised PR!

mattjala

LGTM. Tests pass on my machine as well, and it works with the REST VOL's writes.

add support for fixed width UTF8 strings - #270

b8ccb83

jreadey assigned mattjala Oct 27, 2023

mattjala mentioned this pull request Nov 1, 2023

Tests for binary transfer of fixed UTF8 string #283

Closed

add support for binary request of utf8 fixed width strings

292aa57

updates for fixed utf8 attribute values

f758904

mattjala approved these changes Nov 3, 2023

View reviewed changes

mattjala mentioned this pull request Nov 3, 2023

Support UTF-8 string datatype encoding HDFGroup/vol-rest#87

Merged

jreadey merged commit b0446f1 into master Nov 3, 2023
10 checks passed

jreadey deleted the fixed_utf8 branch November 3, 2023 15:28

mattjala mentioned this pull request Nov 6, 2023

Support fixed-length strings with UTF-8 character set #270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for fixed width UTF8 strings - #270 #278

add support for fixed width UTF8 strings - #270 #278

jreadey commented Oct 27, 2023

mattjala commented Oct 30, 2023 •

edited

Loading

mattjala commented Oct 30, 2023

mattjala commented Oct 31, 2023

jreadey commented Nov 2, 2023

mattjala commented Nov 2, 2023

jreadey commented Nov 2, 2023

mattjala left a comment

add support for fixed width UTF8 strings - #270 #278

add support for fixed width UTF8 strings - #270 #278

Conversation

jreadey commented Oct 27, 2023

mattjala commented Oct 30, 2023 • edited Loading

mattjala commented Oct 30, 2023

mattjala commented Oct 31, 2023

jreadey commented Nov 2, 2023

mattjala commented Nov 2, 2023

jreadey commented Nov 2, 2023

mattjala left a comment

Choose a reason for hiding this comment

mattjala commented Oct 30, 2023 •

edited

Loading