-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imaris IMS Reader: support for LZ4 compression and performance improvements #4217
Conversation
Add jna dependency
Add native libs to formats-gpl jar for ant build
Thanks for writing this up, @marcobitplane. It would have been good to discuss this first, as unfortunately there are several issues that prevent us from including this new reader:
A few possible paths forward:
We certainly appreciate the effort to improve performance and allow for LZ4 support, but in the current state this change isn't something we can accept. |
Thank you for taking the time to look into this and giving us feedback! Our primary goal is to allow Bio-Formats and Fiji users to open any IMS file, of which LZ4 compressed files represent an important part. We’d be happy to rework ImarisImsReader to meet the requirements of Bio-Formats. We decided to add the compiled libraries to the repository thinking it would simplify integration with Bio-Formats, but the source code with build instructions is available at https://github.com/imaris/ImarisReader. The library is written in C++, but the source code can be added to your build system if you would like to build it together with the rest of Bio-Formats. This should also allow building the library for any additional platform you wish to support, as mentioned in your third point. I didn’t realize Bio-Formats already had a tool in place for extracting and loading native libraries, I’m happy to look into it. To clarify the current extraction and loading strategy: the Java interface and loading of the native library is written with jna and it seemed natural to use jna also to extract from the jar. In the current implementation, we first check if the jna.tmpdir is already defined and use that if so. If not, we use the default platform-dependent locations automatically specified by jna through their “extractFromResourcePath()” method. If Bio-Formats has a different recommendation for a temporary directory we can easily adjust the code to match that. Regarding setting jna.library.path, this is jna’s recommended way to make the target library available to the Java program, but again we can adjust that according to Bio-Formats requirements. To your last point, I would also be happy to help sort through differences with the existing reader should they appear. Thank you again for your feedback, hopefully we can resolve all the issues that were raised. |
Sorry for jumping in late, all. (This PR came in while I was blissfully away...)
Our apologies. This is something we should have been clearer about in the documentation. A PR has been opened to propose a new section on native blobs: As you likely know, the history of this revolved around the slidebook reader with DLLs being removed in #2289.
The current Bio-Formats maintainers will not be able to support the addition of a C++ build. |
Ouch! |
One direction we can take is making use of jHDF instead of our C++ ImarisReader library. jHDF is a pure Java implementation to read HDF5 files and already supports LZ4 compression: https://github.com/jamesmudd/jhdf. Would this be compatible with the requirements of Bio-Formats? If so, I’d be happy to rework this PR. |
Just pinging @mkitti (and also @tpietzsch) because as far as I remember he worked quite a lot on hdf5 in Java for Fiji and it'd be nice if the hdf5 library used in Fiji is compatible with the one chosen for bio-formats |
Existing HDF5 dependency in BioformatsTo be clear, there is already a HDF5 C++ library incorporated as a dependency of bioformats here: https://github.com/ome/bioformats/blob/develop/components%2Fformats-bsd%2Fpom.xml#L97-L101 It is the build from Bernd Rinn: That API is documented here: The underlying C library there is HDF5 1.10.9. The 1.10 branch is no longer supported by The HDF Group. This API does not include direct chunk reading or writing, however. JavaCPP HDF5To update to the 1.14 branch, my recommendation is that we use the JavaCPP project:
To facilitate a transition from JHDF5 (ETH CISD/SIS), I have built janelia-jhdf5: https://github.com/JaneliaSciComp/janelia-jhdf5 This exposes almost all of the JHDF5 API using the JavaCPP HDF5 build. The JavaCPP project is built for packaging C++ packages for Java. Currently, the HDF5 build on Windows needs some adjustment, but this should otherwise work: https://github.com/bytedeco/javacpp-presets/blob/master/hdf5%2Fcppbuild.sh We could also use JavaCPP to package LZ4 and the HDF5 plugins. Furthermore, we already incorporate JavaCPP into the SciJava POM: I'm inviting @ctrueden to comment from the SciJava perspective. RecommendationUse the JavaCPP distribution of HDF5 and/or use JavaCPP to package the Imaris C++ code for Java. Then incorporate that here as a dependency. Summary
|
To be very clear, the objection is not to any native code at all, but specifically to the inclusion of a binary blob here in this repository. Also, problematic is the lack of source code to build that binary blob. As I noted above, bioformats already has dependencies, which include binaries blobs containing HDF5 code. Those dependencies also include build recipes for the packaged binary blobs. The use of James Mudd's jHDF5 (the name collisions here are unfortunate) is interesting and may be easier to incorporate at the moment as a "pure" Java library. Having experienced development with the ETH SIS JHDF5, I'm a bit wary of depending on code that is not supported by The HDF Group for long-term projects such as this. My preference here for the JavaCPP build and distribution of HDF5 is that it is a direct packaging of The HDF Group's C and C++ interfaces along with automated Java bindings to the entire interface. |
Thank you very much for adding these points to the discussion. I proposed James Mudd's jHDF because currently it seems to me the most straightforward way to implement a new ims reader since it already supports direct chunk reading, lz4 compression and multiple platforms, and in my early testing has a similar performance as our ImarisReader library. However, I will ultimately try to follow whichever direction is recommended by Bio-Formats, be it jHDF, JavaCPP HDF5 or other. |
I agree that jHDF looks intriguing. Because it is pure Java, we can certainly start managing it in pom-scijava in addition to the JavaCPP bindings. Bio-Formats however is not a SciJava-based project (it does not consume the pom-scijava BOM), and so what we do in pom-scijava will have little impact here. That said, I strongly agree with @mkitti that inclusion of binary blobs into Bio-Formats is not ideal, and in particular it would be very unfortunate if the Bio-Formats build could not reproduce them from source. Many distributions of Linux for example forbid packages that cannot be fully and automatically rebuilt from source code, and in my view for good reason. |
@ctrueden I do wonder if you think this might be a better fit for https://scif.io/ given the dependency structure and the stage of life of bioformats? |
@mkitti That is a big can of worms, which would be better discussed as a topic on the Image.sc Forum. Briefly:
|
Bio-Formats currently depends on two separate libraries for handling HDF5 based file formats:
Both libraries are routinely upgraded 1234 but maintaining two dependencies with comparable functionalities has a real cost. A long-term goal has been to address this discrepancy and unify all Bio-Formats readers on one of the two libraries. Answering specifically the question raised in #4217 (comment), at present, we would consider contributions that would do any of the following:
In the current state of the project, we would not consider the addition of a new reader for the Imaris HDF format and/or the introduction of a new dependency for handling HDF5 to the core Bio-Formats repository. As suggested both in the discussion above and in a recent post of the image.sc forum6, such a large effort could live as a third-party extension managed and distributed separately. Footnotes
|
Thank you all for the clarifications! If I understand correctly, the addition of LZ4 support to the codecs (ome/ome-codecs#41) would allow the existing NetCDF based ims reader to open LZ4 compressed datasets. This would be fantastic. At that point I will just try to add caching to ImarisHDFReader, which should not be a problem. |
ETH CISD/SIS JHDF5 is likely at end-of-life stage. The last four commits on their git repository were initiated by me in 2022. The patch set against upstream HDF5 would need to be significantly reworked for 1.14 or the upcoming 1.16, and it is not clear those changes are still needed given more recent updates to upstream HDF5. |
One question regarding how to use the new LZ4 codec with the existing ims reader: should netCDF-Java return the raw (compressed) chunks that would then be decompressed by the codec? If so, is it possible to get the raw chunks with the current api? An alternative: "As of netCDF-Java version 5.5.1, a ucar.nc2.filter package is included, that provides a suite of implemented filters and compressors, as well as a mechanism for user-supplied filters" (https://docs.unidata.ucar.edu/netcdf-java/5.5/userguide/reading_zarr.html). I think LZ4 support could be provided nicely with this mechanism, if netCDF can be updated from 5.3.3 to at least 5.5.1. Would it be possible for Bio-Formats to upgrade netCDF-Java? |
#4245 proposes to upgrade NetCDF-Java to 5.6.0; we should know tomorrow if that causes any problems. |
I saw the NetCDF-Java upgrade has been merged, I will close this PR and open a new one based on the new version of the library. Thank you everyone again for the help! |
Hello,
this pull request introduces a new file reader (ImarisImsReader.java) for Imaris IMS file format. Our goal is twofold:
ImarisImsReader supports LZ4 compression
ImarisImsReader reads data efficiently by avoiding multiple reads of 3D chunks through an internal caching mechanism
ImarisImsReader makes use of the new open source bpImarisReader library (https://github.com/imaris/ImarisReader) to read data and metadata from IMS files. This guarantees that LZ4 compressed files can be read. bpImarisReader is a native library written in C++, therefore it requires a Java interface. The interface is provided by the jImarisReader class added to ImarisImsReader.java and it is written using JNA (https://github.com/java-native-access/jna). For this reason, jna-5.14.0 has been added as a dependency of the formats-gpl component in its pom.xml file. The compiled versions of bpImarisReader for Windows and macOS also need to be included with Bio-Formats for the new reader to work: we added a new folder “native” at components/formats-gpl/src/loci/formats/native containing the native libs, so that the build process will include them in the formats-gpl jar (for the Ant build, we also needed to modify the build.properties file for the native libs to be included in the jar). ImarisImsReader automatically extracts the appropriate library to a temporary directory and loads it when needed.
Regarding performance, Bio-Formats requests data from a single x-y plane at a time and for all channels sequentially. This is not very compatible with how data is stored in IMS files (3D chunks) and results in the same data being read multiple times, causing a significant performance drop. To address this issue we implemented a caching mechanism that reads a stack of planes (as many planes as the chunk z-size) from all channels into a buffer, which only needs to be updated after all data in it has been read. If the size of the buffer would exceed 1GB, the reader falls back to reading the requested plane only. In addition to caching, avoiding loops over and copies of the data whenever possible also offered a significant performance gain. The exact performance improvement will depend on the details of the dataset: in our testing, for 3D datasets the new reader can be over an order of magnitude faster than the existing implementation.