Simplify .sbx loading & improve space efficiency; explicitly close memmapped files #1329

ethanbb · 2024-04-11T07:23:52Z

Description

This PR addresses the potential problem even when loading just a portion of an .sbx file (except for a subset of frames), such as a single plane, the whole file would be loaded into memory first and then indexed. This could cause an out-of-memory error if the file is large enough, or at least slow things down. This is less of an issue when loading or converting in chunks, but again could slow down the operation.

The solution here is to use np.memmap instead of np.fromfile and index that, so only the desired parts of the file are loaded. Also, I carefully checked the operations on the loaded data to eliminate any extraneous copies (particularly in np.invert). Doing this also means that there's no disadvantage to using the same strategy when loading an arbitrary set of frames rather than a slice, so it simplifies the code. I confirmed using tracemalloc that the peak memory usage is only marginally higher than that needed to store the loaded part of the data and added a test for this.

I also had as part of this branch bug fixes for Windows and older tifffile versions, but I realized when preparing the PR that similar changes had already been merged to dev. I kept the tifffile code from dev the same after checking that it works fine for version 2023.4.12. For the other issue, I saw that the fix was basically a workaround that avoids closing .tif files or reusing the same filename in the test suite. I had determined that the actual issue was not closing the .tif file handle within the memmap after finishing writing to it, and after fixing this, the original tests work fine on Windows, so I kept my version. However, it might make sense to have a little extra leniency for file locking even if this version should theoretically work.

Type of change

Bug fix (non-breaking change which fixes an issue)

Has your PR been tested?

caimanmanager test passes on Windows (Python 3.9) and Linux (Python 3.11).

…s after using)

…o sbxload-update

pgunn

Looks good; noted two concerns, but they're not blockers, just things I was thinking about. I think this is a better fix to the issues I bumped into when trying to get the package out (conda-forge runs our tests during the package release process).

I haven't played with tracemalloc before; feeling that I should.

As far as I'm concerned this is ready to merge; is there anything else you want to do on your end or should I merge it now?

pgunn · 2024-04-11T14:10:11Z

caiman/tests/test_sbx.py

+    tracemalloc.start()
+    data_2d_sliced = sbx_utils.sbxread(file_2d, subindices=(slice(None), slice(None, None, 2)))
+    curr_mem, peak_mem = tracemalloc.get_traced_memory()
+    assert peak_mem / curr_mem < 1.1, 'Too much memory allocated when loading'


I hope this won't lead to issues on test systems without a lot of RAM

I think it should be fine (relative to other tests) since the same entire file is loaded elsewhere? Or did you have a different concern than just running out of RAM?

I'm worried about a positive that may not be that helpful if it's running in CI, but not deeply worried; we can adapt later if we start actually seeing this in circumstances where the test becomes a burden.

pgunn · 2024-04-11T14:11:35Z

caiman/utils/sbx_utils.py

@@ -131,6 +131,7 @@ def sbx_to_tif(filename: str, fileout: Optional[str] = None, subindices: Optiona
    # Open tif as a memmap and copy to it
    memmap_tif = tifffile.memmap(fileout, shape=save_shape, dtype='uint16', photometric='MINISBLACK')
    _sbxread_helper(filename, subindices=subindices, channel=channel, out=memmap_tif, plane=plane, chunk_size=chunk_size)
+    memmap_tif._mmap.close()


Hoping we're not setting ourself up for problems down the line by using a private API

Yeah, agreed. I avoided it for the sbx file memmap by calling np.memmap with a file object instead of a path, and then closing the file when finished, but tifffile.memmap doesn't allow that.

Actually, it looks like the code for tifffile.memmap in the case where the file doesn't already exist is pretty simple. Let me try just replicating that instead, but with the file object trick.

ethanbb · 2024-04-12T04:56:38Z

So I realized that closing a file object passed to np.memmap doesn't actually help close the mmap stored by the np.memmap, as these objects have independent lifetimes. I think it just seemed like this was working because I was doing this for a read-only memmap and not trying to delete the file afterwards.

The only way to actually close the mmap without accessing a private property, as long as it isn't shared between multiple np.memmaps, seems to be just a del statement. I switched to this and it seems to work.

pgunn · 2024-04-12T11:56:14Z

I hope we're reliably getting copies from what was in the memmapped file rather than views into it; if there's any code path that has it remain as a view then the del won't remove the last reference to the memmap and the file will remain open.

Otherwise seems a sensible approach. In theory the Windows-style file-locking is doing is probably more correct/safe, but it's a pain in the butt to deal with. It can avoid some pretty subtle bugs that could come up.

pgunn · 2024-04-12T14:24:31Z

Things are probably in a good place now; if you're happy with it and it passes tests, I can merge; let me know when you're ready

I'll probably follow this up with a diff that will replace use of the "N" variable with something else; single letter caps variables are probably not good names (although if you want to slip that adjustment into this diff before you give me the green light to merge, that'd be fine)

ethanbb · 2024-04-12T16:01:03Z

I hope we're reliably getting copies from what was in the memmapped file rather than views into it; if there's any code path that has it remain as a view then the del won't remove the last reference to the memmap and the file will remain open.

This is a good point. Actually, it's just as important if we stuck with the _mmap.close() strategy, as doing so if _mmap has more than one reference can crash Python.

Looking how the memmaps are used in _sbxread_helper - out should not present a problem, since it's only written to, not read. sbx_mmap is currently definitely only copied, not viewed, since it's only every indexed using advanced indices produced by np.ix_. But I added a comment to be careful not to change that in the future.

Also got rid of those Ns! Ready to merge now I believe.

pgunn · 2024-04-12T16:04:39Z

Once it passes CI I'll merge; on the off chance we run into hiccups the next time conda-forge does builds for a package release I may need to either reach out to you or to do last-minute changes (if things are broken then); if I go with the latter I'll poke you.

ethanbb added 5 commits April 1, 2024 03:11

Fix SBX functions for older tifffile versions and Windows (close mmap…

98ef067

…s after using)

Use memmap to load only indexed parts of files

4a08717

Simplify by using the same loading code for all time index types

163022f

Merge branch 'dev' of https://github.com/flatironinstitute/CaImAn int…

20bb8ab

…o sbxload-update

Revert tifffile write to what is in dev since it works

2647b59

pgunn reviewed Apr 11, 2024

View reviewed changes

Use del to close memmaps rather than _mmap; reduce code duplication

c75a4f8

Add comment about not viewing memmap and rename Ns

6c836c0

pgunn merged commit 69772e3 into flatironinstitute:dev Apr 12, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify .sbx loading & improve space efficiency; explicitly close memmapped files #1329

Simplify .sbx loading & improve space efficiency; explicitly close memmapped files #1329

ethanbb commented Apr 11, 2024

pgunn left a comment

pgunn Apr 11, 2024

ethanbb Apr 11, 2024

pgunn Apr 11, 2024 •

edited

Loading

pgunn Apr 11, 2024

ethanbb Apr 11, 2024

ethanbb commented Apr 12, 2024

pgunn commented Apr 12, 2024

pgunn commented Apr 12, 2024

ethanbb commented Apr 12, 2024

pgunn commented Apr 12, 2024

Simplify .sbx loading & improve space efficiency; explicitly close memmapped files #1329

Simplify .sbx loading & improve space efficiency; explicitly close memmapped files #1329

Conversation

ethanbb commented Apr 11, 2024

Description

Type of change

Has your PR been tested?

pgunn left a comment

Choose a reason for hiding this comment

pgunn Apr 11, 2024

Choose a reason for hiding this comment

ethanbb Apr 11, 2024

Choose a reason for hiding this comment

pgunn Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

pgunn Apr 11, 2024

Choose a reason for hiding this comment

ethanbb Apr 11, 2024

Choose a reason for hiding this comment

ethanbb commented Apr 12, 2024

pgunn commented Apr 12, 2024

pgunn commented Apr 12, 2024

ethanbb commented Apr 12, 2024

pgunn commented Apr 12, 2024

pgunn Apr 11, 2024 •

edited

Loading