Proposal for User-Defined Data Types #57

BenBrock · 2024-10-14T19:06:56Z

Problem Description

We discussed during our meeting today some possibilities for support user-defined types. We discussed two options: a bytes[n] type, which would allow users to create arrays whose elements are arbitrarily sized collections of bytes, and more robust support for user-defined types that would be backed by a third-party library like HDF5 or cffi. The bytes[n] solution, while simple, puts the onus on the user to handle portability across platforms with different endianness, padding, or alignment. True cross-platform support for user-defined types will likely require users to declare the layouts of their custom data types.

We settled on exploring a declarative JSON description for user-defined data types. This JSON description must have a one-to-one correspondence with user-defined data types as supported by libraries like HDF5 or cffi. The idea is that implementations would then be free to store the user-defined data type using the mechanism supported by the binary container.

Strawman Example

{
  "binsparse": {
    "version":      "0.1",
    "format":       "COO",
    "shape":        [428440, 896308],
    "number_of_stored_values":      3782463,
    "data_types":   {
      "values":       "my_cool_struct",
      "indices_0":    "uint32",
      "indices_1":    "uint32"
    },
    "custom_types": {
      "my_cool_struct": [
        "float", "int32", "uint32", "bint8"
      ]
    }
  },
}

Here we have one custom type.

This would correspond to a C struct with four members of type float, int32_t, uint32_t, and uint8_t. The struct might look like this:

typedef struct {
  float v1;
  int v2;
  uint32_t v3;
  uint8_t v4;
} my_cool_struct;

On my system, this struct has a size of 16 bytes. Of course, the padding an alignment of this struct is implementation-defined in C. To compile such a struct, it will need to be declared, and each member will need to have a name. I imagine that in HDF5 this would be provided by the user upon registration of the custom data type.

Just the HDF5 part of this code---written by the user---would look something like the following:

typedef struct {
  float v1;
  int v2;
  unsigned int v3;
  unsigned char v4;
} my_cool_struct;

hid_t create_sensor_datatype() {
    hid_t datatype_id;

    // Create the compound datatype
    datatype_id = H5Tcreate(H5T_COMPOUND, sizeof(my_cool_struct));

    H5Tinsert(datatype_id, "v1", HOFFSET(my_cool_struct, v1), H5T_NATIVE_FLOAT);
    H5Tinsert(datatype_id, "v2", HOFFSET(my_cool_struct, v2), H5T_NATIVE_INT);
    H5Tinsert(datatype_id, "v3", HOFFSET(my_cool_struct, v3), H5T_NATIVE_UINT);
    H5Tinsert(datatype_id, "v4", HOFFSET(my_cool_struct, v4), H5T_NATIVE_UCHAR);

    return datatype_id;
}

I imagine the process for the user would basically be this:

The user registers a custom HDF5 type (with something like bsp_register_hdf5_type(my_struct_hid, "my_cool_struct")). They input an HDF5 custom type (the hid_t) as well as the new type's name.
Based on the registered type, the backend will create and return to the user a new ID for the bsp_type_t enum, perhaps based on hashing the name.
The user reads in a file that uses the newly defined custom data type. This will return the standard bsp_matrix_t, which will itself contain a values array whose type is equal to the newly created bsp_type_t. The implementation can perhaps check that the registered type matches the file's HDF5 type by looking at the types of the corresponding elements.
The user has their data. When they look at the values array, they can see that its type corresponds to the newly created type for my_cool_struct. They can safely cast its data pointer to a pointer to my_cool_struct.

Open Questions

I still have a few open questions:

Should we name custom data type members in the Binsparse JSON description? e.g., instead of storing an array of strings containing the data types, store a tuple: [("v1", "float"), ("v2", "int32"), ("v3", "uint32"), ("v4", "bint8")].
Should we or could we ever attempt to read in a custom data type without a user-registered custom data type? For example, we could have a function that, given a JSON declaration, creates an HDF5 custom data type hid_t. The big challenge here is picking the offsets, since we would need to have an algorithm for picking offsets. These offsets would need to be reproducible, reliable, and correspond to the user's offsets for this to work. There's a danger of things not working here.
I opted for a list of types, which I think is necessary, since a JSON dict is unordered. However, there might be some tweaks we could make to improve the syntax of custom data types.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for User-Defined Data Types #57

Proposal for User-Defined Data Types #57

BenBrock commented Oct 14, 2024

Proposal for User-Defined Data Types #57

Proposal for User-Defined Data Types #57

Comments

BenBrock commented Oct 14, 2024

Problem Description

Strawman Example

Open Questions