-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify .sbx loading & improve space efficiency; explicitly close memmapped files #1329
Conversation
…o sbxload-update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; noted two concerns, but they're not blockers, just things I was thinking about. I think this is a better fix to the issues I bumped into when trying to get the package out (conda-forge runs our tests during the package release process).
I haven't played with tracemalloc before; feeling that I should.
As far as I'm concerned this is ready to merge; is there anything else you want to do on your end or should I merge it now?
tracemalloc.start() | ||
data_2d_sliced = sbx_utils.sbxread(file_2d, subindices=(slice(None), slice(None, None, 2))) | ||
curr_mem, peak_mem = tracemalloc.get_traced_memory() | ||
assert peak_mem / curr_mem < 1.1, 'Too much memory allocated when loading' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this won't lead to issues on test systems without a lot of RAM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine (relative to other tests) since the same entire file is loaded elsewhere? Or did you have a different concern than just running out of RAM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about a positive that may not be that helpful if it's running in CI, but not deeply worried; we can adapt later if we start actually seeing this in circumstances where the test becomes a burden.
caiman/utils/sbx_utils.py
Outdated
@@ -131,6 +131,7 @@ def sbx_to_tif(filename: str, fileout: Optional[str] = None, subindices: Optiona | |||
# Open tif as a memmap and copy to it | |||
memmap_tif = tifffile.memmap(fileout, shape=save_shape, dtype='uint16', photometric='MINISBLACK') | |||
_sbxread_helper(filename, subindices=subindices, channel=channel, out=memmap_tif, plane=plane, chunk_size=chunk_size) | |||
memmap_tif._mmap.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoping we're not setting ourself up for problems down the line by using a private API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed. I avoided it for the sbx file memmap by calling np.memmap with a file object instead of a path, and then closing the file when finished, but tifffile.memmap doesn't allow that.
Actually, it looks like the code for tifffile.memmap in the case where the file doesn't already exist is pretty simple. Let me try just replicating that instead, but with the file object trick.
So I realized that closing a file object passed to np.memmap doesn't actually help close the mmap stored by the np.memmap, as these objects have independent lifetimes. I think it just seemed like this was working because I was doing this for a read-only memmap and not trying to delete the file afterwards. The only way to actually close the mmap without accessing a private property, as long as it isn't shared between multiple np.memmaps, seems to be just a |
I hope we're reliably getting copies from what was in the memmapped file rather than views into it; if there's any code path that has it remain as a view then the del won't remove the last reference to the memmap and the file will remain open. Otherwise seems a sensible approach. In theory the Windows-style file-locking is doing is probably more correct/safe, but it's a pain in the butt to deal with. It can avoid some pretty subtle bugs that could come up. |
Things are probably in a good place now; if you're happy with it and it passes tests, I can merge; let me know when you're ready I'll probably follow this up with a diff that will replace use of the "N" variable with something else; single letter caps variables are probably not good names (although if you want to slip that adjustment into this diff before you give me the green light to merge, that'd be fine) |
This is a good point. Actually, it's just as important if we stuck with the Looking how the memmaps are used in Also got rid of those |
Once it passes CI I'll merge; on the off chance we run into hiccups the next time conda-forge does builds for a package release I may need to either reach out to you or to do last-minute changes (if things are broken then); if I go with the latter I'll poke you. |
Description
This PR addresses the potential problem even when loading just a portion of an .sbx file (except for a subset of frames), such as a single plane, the whole file would be loaded into memory first and then indexed. This could cause an out-of-memory error if the file is large enough, or at least slow things down. This is less of an issue when loading or converting in chunks, but again could slow down the operation.
The solution here is to use np.memmap instead of np.fromfile and index that, so only the desired parts of the file are loaded. Also, I carefully checked the operations on the loaded data to eliminate any extraneous copies (particularly in
np.invert
). Doing this also means that there's no disadvantage to using the same strategy when loading an arbitrary set of frames rather than a slice, so it simplifies the code. I confirmed usingtracemalloc
that the peak memory usage is only marginally higher than that needed to store the loaded part of the data and added a test for this.I also had as part of this branch bug fixes for Windows and older tifffile versions, but I realized when preparing the PR that similar changes had already been merged to dev. I kept the tifffile code from dev the same after checking that it works fine for version 2023.4.12. For the other issue, I saw that the fix was basically a workaround that avoids closing .tif files or reusing the same filename in the test suite. I had determined that the actual issue was not closing the .tif file handle within the memmap after finishing writing to it, and after fixing this, the original tests work fine on Windows, so I kept my version. However, it might make sense to have a little extra leniency for file locking even if this version should theoretically work.
Type of change
Has your PR been tested?
caimanmanager test
passes on Windows (Python 3.9) and Linux (Python 3.11).