Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set name of UDT when serializing #146

Open
eriknw opened this issue Jun 3, 2022 · 12 comments
Open

Set name of UDT when serializing #146

eriknw opened this issue Jun 3, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@eriknw
Copy link
Contributor

eriknw commented Jun 3, 2022

When using GxB_*_serialize, I would like the option to set the name of a user-defined type to be used in the output blob. This is for maximum portability.

A sufficient solution for me would be to have the ability to change the name of a UDT after it's been created. Maybe add GxB_Type_rename? This doesn't require changing the signature of the serialize functions,

p.s. What is GxB_serialize_free mentioned here?

// freed by GxB_serialize_free (which calls the ANSI C11 free if GrB_init was

@DrTimothyAldenDavis
Copy link
Owner

The reference to GxB_serialize_free is a mistake. It should just read "free(blob)", or more precisely, it should be freed by the same free method passed to GxB_Init. I'll fix the comment, and explain in more detail in the user guide.

Changing the name of a user-defined type would be fine for now, but it would destabilize a future version of GraphBLAS. I will be using that name to create GPU and CPU kernels with a jit, and if the name changes, the jitified kernel will break. So GxB_Type_rename is not a good idea.

@eriknw
Copy link
Contributor Author

eriknw commented Jun 3, 2022

Understood, thanks.

There are a few reasons why being able to set the name of the UDT when serializing would be nice for us:

  • To be able to save "instructions" for how to recreate the type in Python so things "just work"
    • For example, the deserialize process won't need to create a UDT with the correct name and type before loading
  • To be able to save a human-readable name
  • To be able to save a different human-readable name than the one used when the type was created

The last one can come up in python-graphblas if users register multiple UDTs that are actually the same underlying type but with different names, in which case we only ever register one with SuiteSparse:GraphBLAS. This was done for technical/usability reasons, so maybe we could work around those and create a new UDT for each name. Even so, there may be other reasons users would want to be able to control the serialization name of UDTs.

In the future, will UDT names need to be valid identifiers? I sure hope not. What would be the cost of changing the name in the future with JIT? Would things need to be rejitted? Why not give it an internal unique identifier for such a purpose (would type_defn work?) and let the name be user-facing and modifiable?

@DrTimothyAldenDavis
Copy link
Owner

The user-defined type names should be a valid identifier that can be used as in the string "wildtype" for this:

typedef struct
{
    float stuff [4][4] ;
    char whatstuff [64] ;
}
wildtype ;

For that type, the name should be the string 'wildtype'. I need a valid identifier to name the type if I am to create a valid kernel that uses the type internally.

So the type can't include any spaces, and can't start with a digit, for example. No semicolons, dots, or whatever.

Constructing my own internal typedef name would be really hard. I need a consistent name to communicate with the user, to return to GxB_Matrix_type for example, and to manage types in JIT'ed kernels and such.

@eriknw
Copy link
Contributor Author

eriknw commented Jun 4, 2022

I see. My bad again for not reading the docs thoroughly.

I guess a more general goal would be to allow users to include extra metadata (perhaps as a string) when serializing an object, and to be able to retrieve this extra metadata when deserializing. This would give me a place--as part of your file format--to include a description of the datatype, which can be used to create the new type in Python if necessary.

@eriknw
Copy link
Contributor Author

eriknw commented Jun 4, 2022

Requiring the name to be a valid identifier really is kind of a pain. In Python, making new UDTs is as easy as:

# Simple struct
>>> gb.dtypes.lookup_dtype({'x': int, 'y': float}).name
"{'x': INT64, 'y': FP64}"

# More complicated (datetime, fixed-length array, and fixed-length string)
>>> gb.dtypes.lookup_dtype({'x': np.datetime64, 'y': np.dtype((int, 3)), 'z': 'S5'}).name
"{'x': dtype('<M8'), 'y': INT64[3], 'z': dtype('S5')}"

This shows default names that we use. Being flexible with dtypes allows us to easily interoperate with other libraries in the Python ecosystem. It would be a burden to require users to give a name that's a valid identifier.

If we are to use the JIT with UDTs in a future version of SuiteSparse:GraphBLAS, I think we'll need to generate our own automatic names to give to SuiteSparse, such as "PythonUDT0", "PythonUDT1", etc. This seems non-ideal and makes it super-hard to rely on your serialization (and I like your serialization!).

@DrTimothyAldenDavis
Copy link
Owner

I'll keep this in mind. The first place where I'll introduce a JIT is for the CUDA kernels, for built-in types, and then user-defined types after that. The typedef, as an entire string, and a valid type identifier, will both be required:

typedef struct
{
    float stuff [4][4] ;
    char whatstuff [64] ;
}
wildtype ;
...
GrB_Type MyWildType ;
GxB_Type_new (&MyWildType, sizeof (wildtype), "wildtype",
"typedef struct { float stuff [4][4] ; char whatstuff [64] ; } wildtype ;") ;

where the "typedef" and trailing identifier "wildtype ;" might not be strictly required. I haven't written this yet but that is my plan. I will need both strings: the typedef (as a proper C typedef not a python typedef), and a valid identifier, so that I can inject them into a file, in C / C++, which I can then compile as a CUDA kernel or as a C-based OpenMP kernel.

@eriknw
Copy link
Contributor Author

eriknw commented Jun 4, 2022

Thanks for the consideration. I'm looking forward to a JIT, although I'm a little paranoid about struct alignment.

imho, it may be reasonable to distinguish between a name, which is flexible, and an identity, which must be a valid identifier. The former is user-facing, the latter compiler-facing. The name can be the identifier by default.

For example, if one is naming an array dtype, it would be natural to name it something like "INT64[8]" or "int64_t[8, 16]". I've worked with teams that take naming semantics very seriously.

@DrTimothyAldenDavis
Copy link
Owner

That might be feasible. Currently, my GxB_Type_new method has 2 parameters: the type name (a valid identified) and the type definition (a string with the typedef in C syntax:

#define GxB_MAX_NAME_LEN 128
GrB_Info GxB_Type_new           // create a new named GraphBLAS type
(
    GrB_Type *type,             // handle of user type to create
    size_t sizeof_ctype,        // size = sizeof (ctype) of the C type
    const char *type_name,      // name of the type (max 128 characters)
    const char *type_defn       // typedef for the type (no max length)
) ;

Internally, I hold this object as follows:

struct GB_Type_opaque       // content of GrB_Type
{
    int64_t magic ;         // for detecting uninitialized objects
    size_t header_size ;    // size of the malloc'd block for this struct, or 0
    size_t size ;           // size of the type
    GB_Type_code code ;     // the type code
    char name [GxB_MAX_NAME_LEN] ;       // name of the type
    char *defn ;            // type definition (currently unused)
} ;

The Type->defn field is currently unused, but I'll use in the future GPU and CPU JITs. The Type->name is fixed in size, to 128 characters max (including the NULL terminator). The Type->name field is returned to the caller via GxB_Type_name, and it will also be used in the JITs. The Type->defn is unbounded in size (I malloc it and keep a copy). I do not yet have a method for returning the Type->defn but I will add that once I start using it. Perhaps GxB_Type_defn could be a method that returns the full definition of the type, as a string.

There are 2 possible solutions for your case. One approach would be to include a comment in the Type->defn, so the string passed in could be

"/* int64_t[8, 16] */ typedef struct { int64_t stuff ; int64_t morestuff [2] ; } myint8_16_type ;"

That would solve 2 problems: the same Type->defn could be parsed (by the python interface) to extract the user-preferred name (which need not be an identifier), and I can still use this Type->defn in my JIT'd kernels. It would not change my API. Would that work? You could even include more python-specific details in the C comments of the typedef string. There would be no limit to its size, since I just scan the string for its string length, and malloc accordingly. My JIT will just treat this string as a black-box, by just handing it over to the compiler after inserting it into the JIT'd C or CUDA kernel.

You could even pass in the following as a single string for my defn parameter to GxB_Type_new, with the exact python definition given to me as a C comment:

/*
gb.dtypes.lookup_dtype({'x': np.datetime64, 'y': np.dtype((int, 3)), 'z': 'S5'}).name
"{'x': dtype('<M8'), 'y': INT64[3], 'z': dtype('S5')}"
*/
typedef struct { datetime64 x ; int64_t y [3] ; double z ; } mypythontype42 ;

The other approach would be to add another parameter to GxB_Type_new, so I would have a type name (any string) a type identifier (a valid C identifier) and a type definition (a string with a valid C typedef). That would be an API-breaking change, however, since I already have GxB_Type_new defined in its current form, for my future JIT.

@DrTimothyAldenDavis
Copy link
Owner

Regarding struct alignment: I will simply be trusting the C compiler to do whatever it normally does for a struct. I can't write a compiler to parse the typedef so I have to trust the C compiler to define it, and also to compile the C functions given to GxB_BinaryOp_new, etc.

@DrTimothyAldenDavis DrTimothyAldenDavis added the enhancement New feature or request label Jun 6, 2022
@eriknw
Copy link
Contributor Author

eriknw commented Jun 6, 2022

Thanks again for your time. Yeah, both of those options probably work for me. Does or will your serialization include the type_defn string? If so, then, yes, both solutions work for me. (Also, does your serialization have a version number?).

If users don't give me a valid identifier to name a UDT, I'll just give it one myself.

@DrTimothyAldenDavis
Copy link
Owner

My serialization does not yet include the type_defn string, but I could add it. The blob does include a version number. I currently don't check it since any of my serialized blobs written by any of my GrB library versions can be read/written by any other GrB library version.

If I were to add the type_defn string, I could do so in an upper-ward compatible way, where (say) SS:GraphBLAS v7.2 would write the blob with the type_defn included. That blob could be read by any of my SS:GraphBLAS versions; they would just ignore the added type_defn. But v7.2.0 could then read back the type_defn.

I do include the type identifier, as an uncompressed string.

@DrTimothyAldenDavis
Copy link
Owner

The GrB_get/set method can set an arbitrary string as the GrB_NAME of a type. It can be the entire type definition if you like. I don't need the name for the JIT, just for get/set. Will this resolve the issue?

I've added both the matrix name and the GrB_ELTYPE_STRING (the GrB_NAME of the type of the matrix) to the serialized blob, in the draft v8.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants