Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr sink Improvements #1713

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Zarr sink Improvements #1713

wants to merge 4 commits into from

Conversation

annehaley
Copy link
Collaborator

@annehaley annehaley commented Nov 1, 2024

Resolves #1674. Resolves #1698.

@annehaley annehaley marked this pull request as ready for review November 5, 2024 18:29
@manthey
Copy link
Member

manthey commented Nov 7, 2024

I have a multiprocess code path that behaves poorly.

In the main process, create the sink and add a tile.
Create a subprocess. It doesn't know that they are already tiles present, so the newaxes list is non-empty. Something uses a lot of memory as a tile is added.

I think the solution is that if new_axes (https://github.com/girder/large_image/pull/1713/files#diff-c79d7356628b5b310aaeb154520b413576727848e175b84f16af2b0eaadb98deR735) is not empty, we set the updateMetadata flag.

My test case was to use the examples/algorithm_progression.py example using --multiprocessing and to hack in an sink.addTile with a 1 pixel record at the maximum sweep locations. That is,

diff --git a/examples/algorithm_progression.py b/examples/algorithm_progression.py
index 52271cfb..2a70c0dc 100755
--- a/examples/algorithm_progression.py
+++ b/examples/algorithm_progression.py
@@ -42,7 +42,7 @@ class SweepAlgorithm:
 
         self.combos = list(itertools.product(*[p['range'] for p in input_params.values()]))
 
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         msg = 'Not implemented'
         raise Exception(msg)
 
@@ -118,7 +118,10 @@ class SweepAlgorithm:
 
     def run(self):
         starttime = time.time()
-        sink = self.getOverallSink()
+        source = large_image.open(self.input_filename)
+        maxValues = {'x': source.sizeX, 'y': source.sizeY, 's': source.metadata['bandCount']}
+        maxValues.update({p['axis']: len(p['range']) for p in self.param_order.values()})
+        sink = self.getOverallSink(maxValues)
 
         print(f'Beginning {len(self.combos)} runs on {self.max_workers} workers...')
         num_done = 0
@@ -149,7 +152,7 @@ class SweepAlgorithm:
 
 
 class SweepAlgorithmMulti(SweepAlgorithm):
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         os.makedirs(os.path.splitext(self.output_filename)[0], exist_ok=True)
         algorithm_name = self.algorithm.__name__.replace('_', ' ').title()
         self.yaml_dict = {
@@ -270,10 +273,14 @@ class SweepAlgorithmMultiZarr(SweepAlgorithmMulti):
 
 
 class SweepAlgorithmZarr(SweepAlgorithm):
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         import large_image_source_zarr
 
-        return large_image_source_zarr.new()
+        sink = large_image_source_zarr.new()
+        if maxValues:
+            sink.addTile(np.zeros((1, 1, maxValues.get('s', 1))),
+                         **{k: v - 1 for k, v in maxValues.items() if k != 's'})
+        return sink
 
     def writeOverallSink(self, sink):
         sink.write(self.output_filename, lossy=self.lossy)

and then python examples/algorithm_progression.py ppc --param=hue_value,hue,0,1,100,open --param=hue_width,width,0.10,0.25,4 -w 36 --multiprocessing build/tox/externaldata/sample_Easy1.png /tmp/sweep.zarr.zip

@manthey
Copy link
Member

manthey commented Nov 7, 2024

Hmm... If we did that, then maybe we'd also need to call self._validateZarr() in the _initNew method if we didn't create the file, but then things fail if we aren't setting data before doing the multiprocessing fork.

@annehaley
Copy link
Collaborator Author

annehaley commented Dec 2, 2024

I took some time to observe the behavior of your example use case in multiple python versions and plot the output from memory-profiler to compare them. The resulting plot is shown below, with timestep (sampled every 0.1 seconds) along X and memory usage along Y. The longest blue line is the main process and all other lines are subprocesses.

For all of these python versions, the only spike in memory I experienced was in the main process after the multiprocessing stage, during the conversion step of the write function. Interestingly, 3.11 and 3.12 have this extra subprocess that hangs around and occupies a negligible but non-zero amount of memory after the multiprocessing stage (shown as a flat orange line).

image

In my experience, 3.10 performed the best and 3.11 performed the worst. Am I remembering correctly that you said you tried this with 3.11? If my 3.11 graph does not reflect what you experienced, can you send me the full list describing your environment so I can do my best to replicate? Otherwise, I'm not sure that we can do much to further improve memory usage, and this may just be a matter of python and other library versions.

EDIT: Here's the command I ran in each environment:

mprof run --multiprocess examples/algorithm_progression.py ppc --param=hue_value,hue,0,1,100,open --param=hue_width,width,0.10,0.25,4 --multiprocessing  /home/anne/data/large_image/Easy1.png /home/anne/data/large_image/generated/sweep.zarr.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set frame values in multiple processes Modify Zarr sink addTile to allow creation of new axes
2 participants