Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring? #266

Open
hmaarrfk opened this issue Aug 12, 2024 · 14 comments
Open

Refactoring? #266

hmaarrfk opened this issue Aug 12, 2024 · 14 comments

Comments

@hmaarrfk
Copy link

Hey Christoph,

Thank you for championing Tiff in Python all these years! I really don't think the python scientific ecosystem would be the same without all your efforts.

We are doing some work in optimizing reading and writing to TIFFs files, often it has to do with the layout of the metadata and how data is written.

I'm wondering if some refactors would be welcome.
There are 2 places were I think we can help (slowly):

  1. Refactoring the white function:
    # TODO: refactor this function
    • The use of del is usually a tell tell sign that you are trying to ensure that temporary variables are not accidentally accessed, which might make things a good place to start refactoring.
  2. Refactoring the FileHandle usage: https://github.com/cgohlke/tifffile/blob/master/tifffile/tifffile.py#L14570
    • We actually found that opening files with specific linux flags helps our performance. We adjusted the write_array functionof our FileHandle, However, lines like:
      fh = FileHandle(file, mode=mode, name=name, offset=offset, size=size)
      make it really hard to make use of these optimizations. It would be great if we can experiment with FileHandle optimizations ahead of working to submit them upstream.

Let me know if you have any appetite for this kind of (collaborative) work.

Best,

Mark

@cgohlke
Copy link
Owner

cgohlke commented Aug 12, 2024

Hi @hmaarrfk,

We are doing some work in optimizing reading and writing to TIFFs files, often it has to do with the layout of the metadata and how data is written.

The TIFF format was designed such that it is easy to optimize for specific cases. It is really impossible to efficiently support all TIFF options, TIFF based variants, and handle possible corruptions. What layouts/metadata are you trying to optimize? Maybe it is easier to implement from scratch.

Refactoring the white function:

I don't think I have the time to refactor the TiffWriter.write method. It was made to dump numpy arrays contiguously to TIFF. Everything else was squeezed into the method later and somehow works. Unfortunately that makes it impossible to write other interesting layouts such as COG (neither does libtiff I think). If I had the time, I would try to support the COG layout.

Refactoring the FileHandle
We actually found that opening files with specific linux flags helps our performance. We adjusted the write_array functionof our FileHandle

I would be interested to see those optimizations. I experimented with Windows native file handles, which was significantly faster than the C runtime. The ScanImage reader takes that approach, but it is optimized to read the specific data layout in ScanImage files. The reason for using FileHandle are in the class docstring. FileHandle is part of the public API and I use it in my other libraries too. For sure FileHandle slows down things and there is room for optimizations.

@hmaarrfk
Copy link
Author

I don't think I have the time to refactor the TiffWriter.write method.

To be clear, I wouldn't expect you to do this on your own. I might need assistance early on to enable github actions (or some other CI of your choice) to ensure we are testing the same use cases (I believe your test suite is quite comprehensive). We would be adding test cases as we refactor things to ensure that the behavior stays consistent.

If you have other projects on github that use FileHandle It would be interesting to add them to the test suite (or potentially add CIs there too).

To start, I might just need you to "enable github actions" if you aren't opposed to it, and I can start to make PRs. You might need to accept a dummy PR from me to avoid giving me too many rights (sometimes github actions disallows automatically running Github actions on PRs for new contributors to a project).

What layouts/metadata are you trying to optimize? Maybe it is easier to implement from scratch.

Generally pretty simple:

  1. Multi-page tiffs where each page is contiguous in memory. Currently there is a np.ascontiguous that takes all the pages prior to writing, and makes the entire thing contiguous.
  2. Appending pages to a tiff.

Metadata is pretty limited to "ImageJ's baseline functionality" for now, but eventually we'll likely opt to support OME tiffs (mostly many of our users use ImageJ so the extra step of using BioFormats is somewhat annoying. Maybe one day I'll pull out my java skills and try to help out there).

The truth is that on a "standard SSD with standard tifffile" we can get get 1-1.5GB/s. With our optimizations we can reach 3 (mostly limited by PCIe Gen3x4) which effectively saturates our SSDs.

Maybe it is easier to implement from scratch.

I think this would be a shame. Your work has been super useful to myself and many others, and I think we can generally help each other build better data management flows.

@cgohlke
Copy link
Owner

cgohlke commented Aug 12, 2024

on a "standard SSD with standard tifffile" we can get get 1-1.5GB/s. With our optimizations we can reach 3 (mostly limited by PCIe Gen3x4) which effectively saturates our SSDs.

On my Windows system, I get up to 3.3 GB/s writing a 2 MB image 20000 times in contiguous mode (that is on an encrypted volume, with virus scanning disabled). A potential 2x improvement would be nice.

Multi-page tiffs where each page is contiguous in memory.

That should work well with TiffFile.write. I'm not sure what's missing that makes refactoring necessary?

Currently there is a np.ascontiguous that takes all the pages prior to writing, and makes the entire thing contiguous.

Do you mean ascontiguous in the FileHandle.write_array method? That is to work around a bug in numpy. Is that fixed in numpy 2?

Appending pages to a tiff.

Do you mean reopening the file and appending pages? That is generally a bad idea, performance wise and metadata may get out of sync.

If you have other projects on github that use FileHandle It would be interesting to add them to the test suite

Mostly czifile (which is tested with test_tifffile) and some private scripts. CziFile is effectively abandoned since many years. It should be OK to refactor FileHandle, even with breaking changes.

To start, I might just need you to "enable github actions" if you aren't opposed to it, and I can start to make PRs. You might need to accept a dummy PR from me to avoid giving me too many rights (sometimes github actions disallows automatically running Github actions on PRs for new contributors to a project).

I do not use GitHub to develop my projects. It's used for publishing the code and support. Also, testing is local on Windows since most test files are not public and the tests read and write in the order of 100 GB from disk.

Can you develop on a fork? The tests should be able to skip all files and features that are inaccessible on CI.

@hmaarrfk
Copy link
Author

On my Windows system, I get up to 3.3 GB/s

I will hand it to windows, getting 3.3GB/s is actually quite hard on Ubuntu by sticking to the default settings.....

PS. Writing over the same image many times might trigger some shortcuts in your SSD, like writing to an temporary cache.

Can you develop on a fork? The tests should be able to skip all files and features that are inaccessible on CI.

I can definitely do that. but the first few commits will likely be to setup the CI. these tend to be on the order of 500 LOC to get windows / mac / linux up and running.

What would the workflow be:

  1. Use my fork.
  2. You observe and comment on the PR to my own fork.
  3. Send you a patch by email?
  4. Get it merged here when you are ok with it?

I'm ok with any workflow of your choosing, this is your project afterall,

Currently there is a np.ascontiguous that takes all the pages prior to writing, and makes the entire thing contiguous.

Do you mean ascontiguous in the FileHandle.write_array method? That is to work around a bug in numpy. Is that fixed in numpy 2?

Even if fixed in numpy2, I don't think i'm quite ready to remove support for 1.26 yet (I am still on it).

It seems like maybe you are open to few contributions that help the overall flow. Let me try to find some time in the coming weeks/months to start on a few low hanging fruits.

@cgohlke
Copy link
Owner

cgohlke commented Aug 12, 2024

I will hand it to windows, getting 3.3GB/s is actually quite hard on Ubuntu by sticking to the default settings.....

PS. Writing over the same image many times might trigger some shortcuts in your SSD, like writing to an temporary cache.

That might be the case. Or encryption took a shortcut? I had an Python extension using the Windows Driver Kit sometimes ago to reset the Windows file system cache for specific files.

It seems like maybe you are open to few contributions that help the overall flow. Let me try to find some time in the coming weeks/months to start on a few low hanging fruits.

Yes, thank you. Just ping me or reference this issue if you have something to review.

@cgohlke
Copy link
Owner

cgohlke commented Aug 12, 2024

For a 50 GB file, I got write speeds around 2 GB/s with tifffile. The results are indistinguishable from writing bytes to a Python binary file handle (buf = b'0' * (1024 * 1024 * 2); with open(filename, 'wb') as fh: for _in range(25000); fh.write(buf)). I don't see what could be done to improve that in tifffile by means of refactoring other than switching to an OS file handle meant specifically for streaming.

@hmaarrfk
Copy link
Author

There are 2 tricks if you search for my username across GitHub that I’ve been promoting for large file access:

  1. ensuring your buffer and the destination are both aligned to 4096.
  2. Bypassing the kernel cache altogether with O_DIRECT on Linux.
  3. your chunk size should be above 1MB or something in python to avoid that overhead eating up the performance

Essentially unaligned accesses kill hardware patterns. So your peripheral needs to transfer to a bounce buffer you ally in the kernel. For this you really want to make sure you make room for your metadata to be loaded well and ideally in as few pages on the file (4096) so that it can be read in quickly later.

O_DIRECT is admittedly a hack, but I tried the MADVISE stuff and they didn’t have the desired effects. I’m pretty convinced that the Tiny arm processors are faster than the big overdesigned kernel algorithms for regular access patterns of big image data (large contiguous writes and reads).

There are likely a few small changes we can make to make things threadsafe by taking advantage of system calls that take in the file position.

All in all, tifffile has been rock solid for us. A few small refactors might just make us able to expose more features as we start to enable compression (which can easily break optimization 1)

@hmaarrfk
Copy link
Author

Also note that without enterprise drives, your SSD can lie to you about the sustained speeds for the first few GB (for the smaller consumer grade ones) to up to 100-200GB for the larger ones. It makes benchmarking really annoying

@hmaarrfk
Copy link
Author

this tool doesn't run on linux, but I do think these are numbers that we are capable of approaching:
https://crystalmark.info/en/software/crystaldiskmark/

You can also see some benchmarks for consumer SSDs
https://www.tomshardware.com/reviews/samsung-990-pro-ssd-review/2

that show that they sustain 7GB/s for about 50 seconds, so maybe 250-400GB of "faster" storage????

@cgohlke
Copy link
Owner

cgohlke commented Aug 13, 2024

The Python reference code writes should be aligned and are not significantly faster than tifffile. Everything seems dependent on the state of the file system cache. To bypass the CRT and the file cache seems like the only way to significantly improve things for tifffile, assuming chunk sizes and storage media are already optimal.

There is an align parameter to control the position of the first segment in the file. It makes no difference in this case for writing. It was added for efficient memory mapping with aligned reads/writes. You might be better off using a pre-created, empty TIFF file created by tifffile.memmap with a file handle optimized for streaming.

TIFF is really not a high performance large file format. The specification recommends 8 KB chunks and requires 2 byte alignment. It's a bad match for the large overhead of Python/numpy/zarr. TIFF readers should make no assumptions about the positions of any structure in the file (except for the file header).

Are you suggesting to align many offsets in the TIFF file (page structures, large tag values, large stripes/tiles) at 4096 instead of 2? Some of those should be easy to add, but at least for stripes/tiles that will probably not work well with tifffile reading.

There are some optimizations to read chunked but contiguous image data in larger batches. I never really confirmed how effective they are.

Regarding metadata: often "global" metadata are available in the first page (like ImageJ or OME-XML strings in the first ImageDescription tag), which should make it efficient to read. However, often those metadata are written last, appended to the end of the file. Really bad for performance. Then there are TIFF files where metadata are distributed among millions of pages in the file...

that show that they sustain 7GB/s for about 50 seconds, so maybe 250-400GB of "faster" storage????

I was testing on the 990 Pro 4 TB and was not able to get anything close to 6-7 GB/s sequential writes (reported by Samsung software) with Python.

@hmaarrfk
Copy link
Author

I guess I’m in a uniquely privileged position when I can write the tiff files the way and read them the way I want.

I may be wrong. But I really do think our application hits above 2 with your library as a backend. So I don’t think it’s the metadata handling.

Perhaps I can write a few small benchmarks to share.

Ps. For the 990Pro. You need to make sure you are on a semi modern motherboard on the PCIe gen 4 slot. The laptop I have on me is too old to get past gen 3 speeds.

@hmaarrfk
Copy link
Author

I’m happy to do a video call too if you want to discuss more real time.

@cgohlke
Copy link
Owner

cgohlke commented Aug 13, 2024

I tested a ctypes implementation writing a sector aligned buffer with Windows' low level WriteFile function and various options (FILE_FLAG_WRITE_THROUGH, FILE_FLAG_NO_BUFFERING) and did not get anything conclusively better. I assume it's the encryption or some other layer interfering. Unfortunately I do not have any other disk fast enough to test this in my system.

Re 990 Pro: It's on PCIe gen 4 and Samsung software measures 6.6 GB/s sequential write speed (maybe testing on unencrypted overprovisioning space?).

@hmaarrfk
Copy link
Author

I think there are two things that might be happening

  1. for unaligned writes. A temporary buffer may be fast enough on modern hardware to allow you to reach 2GB/s. The only this this wastes is DRAM bandwidth or CPU cache. But this won’t be measurable on single threaded workloads.
  2. I think that 2GB/s is about the maximum I measure on Linux. Beyond that I must go multi threaded. My current strategy is to use multiple files (4) with python thread pools and that lets me saturate PCIe Gen 3x4 drives.

I think with my testing on Linux and yours on windows we may be able to get some serious performance from TIFF. At a start 2GB/s is nothing to sneeze at!!!!

I’ll try to create a benchmarking project. Thrashing my SSDs isn’t my favorite activity so I’ve dragged my feet for a while on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants