Skip to content

HDF5 2.0 Planning

Dana Robinson edited this page Nov 18, 2024 · 9 revisions

These are the topics/issues we're going to focus on in the next version of HDF5. For full transparency, we're going to try to track publicly-visible features (i.e., not refactoring) in GitHub so that everyone can see how the release is progressing. HDF5 2.0 issues we plan to address are organized into high-level parent issues (we're in the beta for this GitHub feature) with the specific issues as children. The parent issues are easy to spot since they start with "2.0:" and there are links to them in this document.

NOTE: This list is ambitious, so not everything here is realistically going to make it in.

Major 2.0.0 changes

The biggest changes/features we'll be making to the library in HDF5 2.0 are:

Move to semantic versioning (https://semver.org/)

People have been asking for this for a long time and we get many complaints about our existing scheme. All future HDF5 versions will be major.minor.patch.

Update library defaults to provide better performance with cloud-optimized HDF5 and modern I/O hardware

We'll have more information about this in the winter, but we would like to revisit all the library defaults (cache sizes, etc.), do a few rounds of performance testing, and see if they still make sense in 2025.

Complex number support

We've been wanting to expand the type system for a long time. We added IEEE float16 support in the last release and now we'd like to add complex numbers.

Maintaining two build systems is unnecessary overhead. Keeping the two systems in sync and reinventing the wheel when we have to perform more complicated testing takes non-trivial engineering resources. Starting in the 2.0.0 release, all Autotools files will be removed and we will no longer support it.

Remove the C++ wrappers

These have been (mostly) unmaintained for some time and we have neither the plans nor the resources to bring them up to speed with modern C++.

Remove the HDF5 <--> GIF tools

These tools have unfixed CVE issues, are not actively maintained, and are an odd fit for the library.

Other 2.0.0 changes

The heading for each of the following topics points to the GitHub parent issue for all the specific child issues.

Since we're dropping Autotools support, it's imperative that CMake works well. We'll make a pass over both build systems to ensure that CMake does everything the Autotools do, simplify the build system code, revamp the install docs, and work to fix all the open GitHub issues. We're also hoping to make the CMake-built compiler wrappers (e.g., h5cc) behave more like the Autotools.

Over the past few years, we've dramatically expanded our CI and we'll continue to do that for HDF5 2.0.0. We now report to my.dash and you can see the output of our GitHub CI under the GHDaily heading, as well as test results from many HPC systems. Improvements over the next few months will include testing our develop branch with the development trunk of both OpenMPI and MPICH, testing HDF5 with the HighFive C++ wrapper, and adding missing configurations.

HDF5 is heavily used on HPC systems, so we'll continue to fix bugs as they arise.

There are several bugs in the CMake code that deals with building the compression filters that we'd like to fix. We'll also be improving the hdf5_plugins repo.

These are potential security problems, so we try to prioritize these for fixes. We are currently CVE-free and hope to have all oss-fuzz issues closed by the release date. We're also hoping to have the library be sanitizer-clean by the time we release, and then add CI checks to ensure it stays that way.

These are also high-priority issues. We hope to have all of these fixed by the release date. The highest priority of these is the memory backed files copy issue.

Cloud-optimized HDF5 is very important to us, and we try to get these tweaks and bugfixes out as soon as we can. HDF5 2.0.0 should support s3 URLs and AWS environment variables in the read-only S3 VFD.

We have a few problems with floating-point data that we'd like to fix in 2.0. Most importantly, Nvidia's nvhpc fails some of our long double tests, and we have long-standing problems converting IBM's POWER long doubles. It'd also be nice to get predefined datatypes set up for FP8, and both flavors of FP4, which are all important in machine learning.

Windows has long been a second-class citizen when it comes to HDF5, and it'd be nice to make that less so in HDF5 2.0. The most important issue to fix is our spotty ability to handle Unicode file names on Windows. Getting CI set up to ensure any fixes don't break in the future will be challenging, but we're hoping to do that in this release. We're also working on better support for MinGW, especially with MSYS2. We've also had code for a VFD based on Win32 API calls donated to us years ago, and it'd be nice to modernize that code, get it to pass our full CI, and get that into the library.

HDF5 documentation has many rough edges. This release should see some improvements to the reference manual and user guide. In particular, we'd like to add a section to the UG that covers cloud-optimized HDF5 and an upgrading guide for people looking to move to a newer version of the library. As mentioned earlier, we also plan to rework our install docs.

There are several issues that don't seem to fit into the above categories:

  • We'll look into moving to Fortran 2008 as that might help us cross-compile when also building the Fortran wrappers
  • For Java, we'll add wrapper functions for the direct chunk I/O functions, we'll also try to get the Java code into Maven