Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imaris IMS Reader: support for LZ4 compression and performance improvements #4217

Closed
wants to merge 6 commits into from

Conversation

marcobitplane
Copy link

Hello,

this pull request introduces a new file reader (ImarisImsReader.java) for Imaris IMS file format. Our goal is twofold:

  • ImarisImsReader supports LZ4 compression

  • ImarisImsReader reads data efficiently by avoiding multiple reads of 3D chunks through an internal caching mechanism

ImarisImsReader makes use of the new open source bpImarisReader library (https://github.com/imaris/ImarisReader) to read data and metadata from IMS files. This guarantees that LZ4 compressed files can be read. bpImarisReader is a native library written in C++, therefore it requires a Java interface. The interface is provided by the jImarisReader class added to ImarisImsReader.java and it is written using JNA (https://github.com/java-native-access/jna). For this reason, jna-5.14.0 has been added as a dependency of the formats-gpl component in its pom.xml file. The compiled versions of bpImarisReader for Windows and macOS also need to be included with Bio-Formats for the new reader to work: we added a new folder “native” at components/formats-gpl/src/loci/formats/native containing the native libs, so that the build process will include them in the formats-gpl jar (for the Ant build, we also needed to modify the build.properties file for the native libs to be included in the jar). ImarisImsReader automatically extracts the appropriate library to a temporary directory and loads it when needed.

Regarding performance, Bio-Formats requests data from a single x-y plane at a time and for all channels sequentially. This is not very compatible with how data is stored in IMS files (3D chunks) and results in the same data being read multiple times, causing a significant performance drop. To address this issue we implemented a caching mechanism that reads a stack of planes (as many planes as the chunk z-size) from all channels into a buffer, which only needs to be updated after all data in it has been read. If the size of the buffer would exceed 1GB, the reader falls back to reading the requested plane only. In addition to caching, avoiding loops over and copies of the data whenever possible also offered a significant performance gain. The exact performance improvement will depend on the details of the dataset: in our testing, for 3D datasets the new reader can be over an order of magnitude faster than the existing implementation.

@dgault
Copy link
Member

dgault commented Aug 14, 2024

Thanks for writing this up, @marcobitplane.

It would have been good to discuss this first, as unfortunately there are several issues that prevent us from including this new reader:

  • Native libraries are directly added to the repository, with no clear indication of how they were compiled.
  • Native libraries are loaded without reusing existing tools (i.e. native-lib-loader) that can unpack and load native libraries from a jar. The jna.library.path and jna.tmpdir system properties are instead modified within the reader constructor. Resetting system properties in a reader is not something we can accept, as it can have significant unintended consequences for applications that use Bio-Formats.
  • Some common platforms are not supported by the added native libraries; ImarisImsReader cannot be instantiated at all in these cases, as an exception is thrown:
Exception in thread "main" java.io.IOException: Native library (loci/formats/native/libbpImarisReader.so) not found in resource path (/home/melissa/bioformats/artifacts/bioformats_package.jar:.)
  at com.sun.jna.Native.extractFromResourcePath(Native.java:1141)
  at com.sun.jna.Native.extractFromResourcePath(Native.java:1076)
  at loci.formats.in.ImarisImsReader.<init>(ImarisImsReader.java:95)
  • Inserting ImarisImsReader ahead of the existing ImarisHDFReader in readers.txt means that for platforms where a compatible native library was provided, all files will be read with ImarisImsReader by default. This means that we need to ensure that ImarisImsReader reports the exact same dimensions, image data, etc. across all of our test data; if any differences, our regression tests would fail.

A few possible paths forward:

  • Define ImarisImsReader as an external reader. ImarisImsReader can then stay where it is in readers.txt, but with [type=external] appended. The reader itself and any supporting libraries would not be included in base Bio-Formats, but are something you would maintain and distribute separately. ZarrReader (https://github.com/ome/ZarrReader) is an example of how this works.
  • Focus on improving the existing ImarisHDFReader, in particular adding caching in openBytes. We'd be happy to consider something along those lines, and certainly understand that performance could be improved in that reader.
  • Substantially rework ImarisImsReader and the architecture surrounding use of native libraries, as indicated in the points above.

We certainly appreciate the effort to improve performance and allow for LZ4 support, but in the current state this change isn't something we can accept.

@marcobitplane
Copy link
Author

Thank you for taking the time to look into this and giving us feedback!

Our primary goal is to allow Bio-Formats and Fiji users to open any IMS file, of which LZ4 compressed files represent an important part. We’d be happy to rework ImarisImsReader to meet the requirements of Bio-Formats.

We decided to add the compiled libraries to the repository thinking it would simplify integration with Bio-Formats, but the source code with build instructions is available at https://github.com/imaris/ImarisReader. The library is written in C++, but the source code can be added to your build system if you would like to build it together with the rest of Bio-Formats. This should also allow building the library for any additional platform you wish to support, as mentioned in your third point.

I didn’t realize Bio-Formats already had a tool in place for extracting and loading native libraries, I’m happy to look into it. To clarify the current extraction and loading strategy: the Java interface and loading of the native library is written with jna and it seemed natural to use jna also to extract from the jar. In the current implementation, we first check if the jna.tmpdir is already defined and use that if so. If not, we use the default platform-dependent locations automatically specified by jna through their “extractFromResourcePath()” method. If Bio-Formats has a different recommendation for a temporary directory we can easily adjust the code to match that. Regarding setting jna.library.path, this is jna’s recommended way to make the target library available to the Java program, but again we can adjust that according to Bio-Formats requirements.

To your last point, I would also be happy to help sort through differences with the existing reader should they appear.

Thank you again for your feedback, hopefully we can resolve all the issues that were raised.

@joshmoore
Copy link
Member

Sorry for jumping in late, all. (This PR came in while I was blissfully away...)

We decided to add the compiled libraries to the repository thinking it would simplify integration with Bio-Formats...

Our apologies. This is something we should have been clearer about in the documentation. A PR has been opened to propose a new section on native blobs:

👉🏽 https://ome-contributing--21.org.readthedocs.build/en/21/third-party-policy.html#proprietary-native-code-blobs

As you likely know, the history of this revolved around the slidebook reader with DLLs being removed in #2289.

but the source code with build instructions is available at https://github.com/imaris/ImarisReader. The library is written in C++, but the source code can be added to your build system if you would like to build it together with the rest of Bio-Formats. This should also allow building the library for any additional platform you wish to support, as mentioned in your third point.

The current Bio-Formats maintainers will not be able to support the addition of a C++ build.

@ImarisRyder
Copy link

Ouch!
Not great. We would really like the reader to come with Bioformats natively. Is there no way for us to achieve that?

@marcobitplane
Copy link
Author

One direction we can take is making use of jHDF instead of our C++ ImarisReader library. jHDF is a pure Java implementation to read HDF5 files and already supports LZ4 compression: https://github.com/jamesmudd/jhdf. Would this be compatible with the requirements of Bio-Formats? If so, I’d be happy to rework this PR.

@NicoKiaru
Copy link
Contributor

Just pinging @mkitti (and also @tpietzsch) because as far as I remember he worked quite a lot on hdf5 in Java for Fiji and it'd be nice if the hdf5 library used in Fiji is compatible with the one chosen for bio-formats

@mkitti
Copy link
Member

mkitti commented Sep 20, 2024

Existing HDF5 dependency in Bioformats

To be clear, there is already a HDF5 C++ library incorporated as a dependency of bioformats here:

https://github.com/ome/bioformats/blob/develop/components%2Fformats-bsd%2Fpom.xml#L97-L101

It is the build from Bernd Rinn:
https://sissource.ethz.ch/sispub/jhdf5

That API is documented here:
https://openbis.ch/javadoc/jhdf5/19.04.1/overview-summary.html
https://unlimited.ethz.ch/display/JHDF/Documentation+Page

The underlying C library there is HDF5 1.10.9. The 1.10 branch is no longer supported by The HDF Group.

This API does not include direct chunk reading or writing, however.

JavaCPP HDF5

To update to the 1.14 branch, my recommendation is that we use the JavaCPP project:
https://bytedeco.org/javacpp-presets/hdf5/apidocs/

<dependency>
    <groupId>org.bytedeco</groupId>
    <artifactId>hdf5</artifactId>
    <version>1.14.3-1.5.10</version>
</dependency>

To facilitate a transition from JHDF5 (ETH CISD/SIS), I have built janelia-jhdf5: https://github.com/JaneliaSciComp/janelia-jhdf5

This exposes almost all of the JHDF5 API using the JavaCPP HDF5 build.

The JavaCPP project is built for packaging C++ packages for Java.

Currently, the HDF5 build on Windows needs some adjustment, but this should otherwise work:

https://github.com/bytedeco/javacpp-presets/blob/master/hdf5%2Fcppbuild.sh

We could also use JavaCPP to package LZ4 and the HDF5 plugins.

Furthermore, we already incorporate JavaCPP into the SciJava POM:
https://github.com/scijava/pom-scijava/blob/master/pom.xml

I'm inviting @ctrueden to comment from the SciJava perspective.

Recommendation

Use the JavaCPP distribution of HDF5 and/or use JavaCPP to package the Imaris C++ code for Java. Then incorporate that here as a dependency.

Summary

  1. We have a HDF5 library in Bioformats already in the form of the ETH CISD/SIS JHDF5
  2. I recommend using JavaCPP HDF5.
  3. I have code to transition from ETH CISD/SIS JHDF5 to JavaCPP HDF5.
  4. JavaCPP would be the best way to package the Imaris binaries directly

@mkitti
Copy link
Member

mkitti commented Sep 20, 2024

One direction we can take is making use of jHDF instead of our C++ ImarisReader library. jHDF is a pure Java implementation to read HDF5 files and already supports LZ4 compression: https://github.com/jamesmudd/jhdf. Would this be compatible with the requirements of Bio-Formats? If so, I’d be happy to rework this PR.

To be very clear, the objection is not to any native code at all, but specifically to the inclusion of a binary blob here in this repository. Also, problematic is the lack of source code to build that binary blob.

As I noted above, bioformats already has dependencies, which include binaries blobs containing HDF5 code. Those dependencies also include build recipes for the packaged binary blobs.

The use of James Mudd's jHDF5 (the name collisions here are unfortunate) is interesting and may be easier to incorporate at the moment as a "pure" Java library. Having experienced development with the ETH SIS JHDF5, I'm a bit wary of depending on code that is not supported by The HDF Group for long-term projects such as this.

My preference here for the JavaCPP build and distribution of HDF5 is that it is a direct packaging of The HDF Group's C and C++ interfaces along with automated Java bindings to the entire interface.

@marcobitplane
Copy link
Author

Thank you very much for adding these points to the discussion.

I proposed James Mudd's jHDF because currently it seems to me the most straightforward way to implement a new ims reader since it already supports direct chunk reading, lz4 compression and multiple platforms, and in my early testing has a similar performance as our ImarisReader library.

However, I will ultimately try to follow whichever direction is recommended by Bio-Formats, be it jHDF, JavaCPP HDF5 or other.

@ctrueden
Copy link
Member

ctrueden commented Sep 24, 2024

I agree that jHDF looks intriguing. Because it is pure Java, we can certainly start managing it in pom-scijava in addition to the JavaCPP bindings. Bio-Formats however is not a SciJava-based project (it does not consume the pom-scijava BOM), and so what we do in pom-scijava will have little impact here.

That said, I strongly agree with @mkitti that inclusion of binary blobs into Bio-Formats is not ideal, and in particular it would be very unfortunate if the Bio-Formats build could not reproduce them from source. Many distributions of Linux for example forbid packages that cannot be fully and automatically rebuilt from source code, and in my view for good reason.

@mkitti
Copy link
Member

mkitti commented Sep 24, 2024

@ctrueden I do wonder if you think this might be a better fit for https://scif.io/ given the dependency structure and the stage of life of bioformats?

@ctrueden
Copy link
Member

I do wonder if you think this might be a better fit for https://scif.io/ given the dependency structure and the stage of life of bioformats?

@mkitti That is a big can of worms, which would be better discussed as a topic on the Image.sc Forum. Briefly:

  • Bio-Formats is alive and well, actively maintained and released, and I see no reason not to accept community-contributed format additions here.
  • SCIFIO has a Bio-Formats integration layer which makes (nearly) all Bio-Formats formats accessible via the SCIFIO API, so adding it here will also make it available for SCIFIO/SciJava users.
  • The SCIFIO API is n-dimensional while Bio-Formats is still essentially 5D, which makes it slightly awkward to shoehorn n-dimensional file formats into Bio-Formats; but:
  • Both Bio-Formats and SCIFIO are fundamentally built around 2D/3D planes, not n-dimensional blocks, which also makes it awkward and not optimally performant to shoehorn block-based formats into either; and:
  • What's really needed IMHO is a revamped SCIFIO written in a cross-language-friendly paradigm such as Kotlin Multiplatform or GraalVM Native Image or perhaps Rust with language bindings, and a concerted effort to update all existing Bio-Formats and SCIFIO formats to that revamped platform, for the benefit of a wider breadth of scientific technology stacks. But:
  • It would be a substantial effort, and we do not currently have funding to develop SCIFIO. To do it right would require several folks to heavily focus on it to create a quality product as quickly as possible, or else the cat-herding problem will not be overcome.

@sbesson
Copy link
Member

sbesson commented Sep 25, 2024

Bio-Formats currently depends on two separate libraries for handling HDF5 based file formats:

  • NetCDF Java (currently edu.ucar:cdm-core:5.3.3) is the historical library and is used by ImarisHDFReader, MINCReader and VeecoReader
  • ETH CISD/SIS JHDF5 (currently cisd:jhdf5:19.04.1) was introduced in 2015 and is used by BDVReader, CellH5Reader and CellH5Writer

Both libraries are routinely upgraded 1234 but maintaining two dependencies with comparable functionalities has a real cost. A long-term goal has been to address this discrepancy and unify all Bio-Formats readers on one of the two libraries.

Answering specifically the question raised in #4217 (comment), at present, we would consider contributions that would do any of the following:

  • update the existing Imaris HDF reader to add support for LZ4 compression 5
  • update the existing Imaris HDF reader to use the JHDF5 if this was useful
  • update the existing Imaris HDF reader to add caching and reduce the number of reads

In the current state of the project, we would not consider the addition of a new reader for the Imaris HDF format and/or the introduction of a new dependency for handling HDF5 to the core Bio-Formats repository. As suggested both in the discussion above and in a recent post of the image.sc forum6, such a large effort could live as a third-party extension managed and distributed separately.

Footnotes

  1. https://github.com/ome/bioformats/pull/3358

  2. https://github.com/ome/bioformats/pull/3788

  3. https://github.com/ome/bioformats/pull/3338

  4. https://github.com/ome/bioformats/pull/3745

  5. see https://github.com/ome/ome-codecs/pull/41 for an ongoing contribution to add LZ4 support to the codecs which may be relevant

  6. https://forum.image.sc/t/bio-formats-update/101734

@marcobitplane
Copy link
Author

Thank you all for the clarifications!

If I understand correctly, the addition of LZ4 support to the codecs (ome/ome-codecs#41) would allow the existing NetCDF based ims reader to open LZ4 compressed datasets. This would be fantastic.

At that point I will just try to add caching to ImarisHDFReader, which should not be a problem.

@mkitti
Copy link
Member

mkitti commented Sep 26, 2024

ETH CISD/SIS JHDF5 is likely at end-of-life stage. The last four commits on their git repository were initiated by me in 2022. The patch set against upstream HDF5 would need to be significantly reworked for 1.14 or the upcoming 1.16, and it is not clear those changes are still needed given more recent updates to upstream HDF5.

@marcobitplane
Copy link
Author

One question regarding how to use the new LZ4 codec with the existing ims reader: should netCDF-Java return the raw (compressed) chunks that would then be decompressed by the codec? If so, is it possible to get the raw chunks with the current api?

An alternative: "As of netCDF-Java version 5.5.1, a ucar.nc2.filter package is included, that provides a suite of implemented filters and compressors, as well as a mechanism for user-supplied filters" (https://docs.unidata.ucar.edu/netcdf-java/5.5/userguide/reading_zarr.html). I think LZ4 support could be provided nicely with this mechanism, if netCDF can be updated from 5.3.3 to at least 5.5.1. Would it be possible for Bio-Formats to upgrade netCDF-Java?

@melissalinkert
Copy link
Member

#4245 proposes to upgrade NetCDF-Java to 5.6.0; we should know tomorrow if that causes any problems.

@marcobitplane
Copy link
Author

I saw the NetCDF-Java upgrade has been merged, I will close this PR and open a new one based on the new version of the library. Thank you everyone again for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants