-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with HDF5 as a cross-platform binary data container #9
Comments
Apologies for my ignorance, but can we get a definition or list of requirements for a "cross-platform binary data container"? |
Did you have a chance to take a look at H5CPP? Also recommend lighning talk and these slides In my understanding with the exception of |
Hi @gheber and @steven-varga , thanks for taking an interest. I have not taken a look at H5CPP, but had been trying to use the HDF5 C++ bindings, which is perhaps in large part responsible for my woes. What we need is to take an array and read/write it to/from disk in a cross-platform way. My primary pain point with HDF5 was having to pick an HDF5 type when copying to/from arrays. My feeling was that I'm likely to make a mistake and pick the wrong type for some platforms. For example, if I'm copying an array into a dataset, I need to select an HDF5 type for the new dataset. As far as I can tell, there's no built-in mechanism in HDF5 to get an HDF5 type for a particular C++ type, so I wrote one of my own. However, this feels very error prone---for example, it's unclear which HDF5 native type to pick for @steven-varga , maybe you can shed some light on how this is handled in H5CPP? |
In the official version you have the option of:
Here is a link to examples; unfortunately the website has been down for a while, because of an unrelated computer security event. When schedule allows will restore/resume development. Looking at your example: it is a good start, however there is a lot more to it to get it right, and in my experience it should look structurally different from yours. If you are interested in the details, thanks to the HDFGroup, few years ago I spoke about this at various events, you maybe able to find the links on the HDFGroup C++ mailing lists; less fortunate that the material is on the currently downed website. Please use supported Most functionality of HDF5 C are implemented/supported, works with MPI, it is suitable for financial/trading systems and many labs has been using it successfully over the world. |
Sorry if I'm late to the party, but why do you think HDF5 doesn't handle the byte order for you? When you create a dataset, you do, in fact, have to specify the way you want to store the data, but you can then use H5T_NATIVE_INT, etc. when you call H5Dread() to munge the data into and out of whatever your buffer datatype is. When you do this, HDF5 will handle the type conversion for you, including BE/LE byte swapping. What HDF5 will not do for you, however, is guess what is going to be an efficient datatype for you. You have to decide if you want a BE or LE datatype, for example. Given the low number of BE systems these days, I'd probably go with LE so you don't waste time munging bytes every time you perform I/O on LE systems. Also, C and C++'s original type system is vague on purpose, so you'll have to figure out what is appropriate for storing, say, long integers, that will work across the systems you support. You can, of course, specify the native type as the dataset type when you create the dataset, but that would potentially make your HDF5 files differ across platforms. HDF5 also isn't going to guess what's a great equivalent for system-dependent types like size_t, either. By design those are system dependent, so like the platform-dependent legacy C/C++ integer types, no container will be perfect for all systems. This shouldn't be too onerous, though - Most systems you'll find in the real world will be LE systems, so BE will be less of a concern unless you are on SPARC or Power. Most of the legacy integer types are the same across all platforms aside from long, which differs on Windows (LLP64) and everything else (LP64), so you can use H5T_STD_(U|I)64LE, which will work for any system. If you are concerned with storing system-y stuff like size_t, picking H5T_STD_U64LE would be appropriate since I don't know of any realistic plans for 128-bit address spaces. |
I'm creating this to record a few issues I'm encountering using HDF5 in C++. It's possible some of these issues are fixed in the Python bindings or by NetCDF, but they seem to remain as issues in C/C++.
A significant amount of work is (likely) going to be required to make HDF5 deal with endianness properly in all cases. HDF5 has two kinds of types, standard types, which are fully defined with a certain bit width, endianness, etc., and native types, which are platform defined.
STD_I32LE
, a two's complement standard 32-bit signed integer in little-endian format, is one such standard format, andNATIVE_INT
is a native format (corresponding to a C/C++int
). When we store an array in an HDF5 dataset, we must pick a native type that corresponds to the in-memory representation and a standard type that corresponds to what we want stored on disk. When we read, we just give a native type corresponding to the read buffer, since the in-file format is already set in stone. For integer types, this works well, as for variants ofint
we can easily test whether we have anint32_t
,uint32_t
, etc. However, for some standard types likestd::size_t
,std::wchar_t
, and evenchar
, which are typically equivalent to, but distinct from some integer type, we (the implementer) have to pick. I've picked some defaults that work well on Intel processors, and likely on most modern systems, but personally I would like HDF5 to handle the endianness issue for me here.No (or at least poor) support for storing UTF-8 text in datasets.
No support for bit arrays.
For user-defined types, it certainly seems like some some work would be required on the user's part, as there is an MPI-like specification for user-defined data types.
The text was updated successfully, but these errors were encountered: