Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nbdev_test parallel() process re-use results in hard to debug behaviour #1287

Open
xl0 opened this issue Feb 6, 2023 · 10 comments
Open

nbdev_test parallel() process re-use results in hard to debug behaviour #1287

xl0 opened this issue Feb 6, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@xl0
Copy link
Contributor

xl0 commented Feb 6, 2023

Run nbdev_test --n_workers 1 --file_re "0[07].*" --do_print in fastai repo (after installing the dev dependencies)
Error I'm getting:

xl0@vespa:~/work/code/fastai$ nbdev_test --n_workers 1  --file_re "0[70].*" --do_print
Starting /ssd/xl0/work/code/fastai/nbs/00_torch_core.ipynb
- Completed /ssd/xl0/work/code/fastai/nbs/00_torch_core.ipynb
Starting /ssd/xl0/work/code/fastai/nbs/07_vision.core.ipynb
AttributeError in /ssd/xl0/work/code/fastai/nbs/07_vision.core.ipynb:
===========================================================================

While Executing Cell #41:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 1
----> 1 test_fig_exists(ax)

File ~/mambaforge/envs/test/lib/python3.9/site-packages/fastcore/test.py:101, in test_fig_exists(ax)
     99 def test_fig_exists(ax):
    100     "Test there is a figure displayed in `ax`"
--> 101     assert ax and len(ax.figure.canvas.tostring_argb())

File ~/mambaforge/envs/test/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:438, in FigureCanvasAgg.tostring_argb(self)
    431 def tostring_argb(self):
    432     """
    433     Get the image as ARGB `bytes`.
    434 
    435     `draw` must be called at least once before this function will work and
    436     to update the renderer for any subsequent changes to the Figure.
    437     """
--> 438     return self.renderer.tostring_argb()

AttributeError: 'FigureCanvasAgg' object has no attribute 'renderer'

- Completed /ssd/xl0/work/code/fastai/nbs/07_vision.core.ipynb

nbdev Tests Failed On The Following Notebooks:
==================================================
        07_vision.core.ipynb
(test) xl0@vespa:~/work/code/fastai$ 

--file_re "0[07].*" meaning, test notebooks 00 and 07.
nbdev_test will use fastcore ProcessPoolExecutor with 1 worker, which means the worker process will be re-used between notebooks.

I don't completely understand the source of the error - both notebooks 00 and 07 make use of test_fig_exists(ax), but only one of them fails, but pretty certain it comes from some internal state being shared between notebooks by the re-used process: No failure occurs if notebooks are run separately, or if --n_workers 2 and each notebook gets a fresh worker process.

While the issue could possibly be fixed in fastai, I believe that the behaviour is itself a bug - a test framework should not leak state between unrelated notebooks. This behaviour results in unexpected and frustrating failures. I earlier ran into a similar issue using JAX with nbdev, when two notebooks would initialize JAX with different parameters, but since the process is being re-used, the second initialization in the same process is ignored.

@xl0 xl0 added the bug Something isn't working label Feb 6, 2023
@deven367
Copy link
Contributor

deven367 commented Feb 6, 2023

@xl0 you need to use nbdev_test --n_workers 0 to run it on serially on a single thread.

@xl0
Copy link
Contributor Author

xl0 commented Feb 6, 2023

@deven367, The only difference between 0 and 1 workers is, with 0 the notebooks under test will be executed in the context of the nbdev_test process. With n_workers=1, one ProcessPoolExecutor worker process will be created, and the notebooks will be executed in this process one by one.

The resulting failure is the same in both cases.

@seeM
Copy link
Contributor

seeM commented Feb 12, 2023

I agree that this is hard to debug and ideally we wouldn't leak state.

It's been a while but IIRC these are two possible fixes:

@xl0
Copy link
Contributor Author

xl0 commented Feb 12, 2023

I will take a look at number 2 once we've sorted out the mess with excessive diffs.

@xl0
Copy link
Contributor Author

xl0 commented Feb 12, 2023

I'm not super familiar with parallel programming, especially in Python, so here's a simple solution that I can think of.

A partial function f is passed to parallel(), and it will spawn/fork a simple multiprocessing.Process, pass the args to it, and wait for its completion and return the result.

Since in this case f does not actually perform any computation, we can use a thread pool instead of a process pool.

Does this make sense @seeM , @jph00 ?

@jph00
Copy link
Contributor

jph00 commented Feb 13, 2023

IMO the suggestion to use maxtasksperchild is probably the right fix.

@xl0
Copy link
Contributor Author

xl0 commented Feb 14, 2023

So this was a bit of a rabbit hole.

I introduced class ThreadPool(multiprocessing.pool.Pool) that more or less mirrors Fastcore ThreadPoolExecutor.
Problem is, pool.Pool creates daemon processes, and those can't have children on their own, and DataLoaders don't work. And there is no straightforward way to override this either.

The solution is https://stackoverflow.com/questions/6974695/python-process-pool-non-daemonic : You have to subclass the appropriate context class (the thing you get from get_context(...) ) and feed it with a subclassed Process class 🤦

xl0/fastcore@6035197

With parallel() supporting ProcessPool, the change to nbdev is trivial:
xl0@93a7478

@jph00 , @seeM , Does this look OK to you?

I ran nbdev_test for fastcore, nbdev, and fastai, and all 3 now pass without issue.

One extra thought. If someone is trying to debug their failed tests and passes --n_workers=0, nbdev_test will process all notebooks in the parent process and will revert to the buggy behaviour. Should we disallow using 0 workers for more than 1 target notebook?

@jph00
Copy link
Contributor

jph00 commented Feb 15, 2023

I think this looks reasonable - it's more complicated than I'd hoped, but it sounds like there might not be a simpler option? Although I see something on that SO page saying perhaps nested pools works on on py38+?

I don't think we should disallow using 0 workers for more than 1 target notebook -- people should be able to choose to do what they want.

@xl0
Copy link
Contributor Author

xl0 commented Feb 15, 2023

@jph00 , I think the SO reply refers to concurrent.futures.ProcessPoolExecutor, which does not support max tasks per child.

Please have a look at AnswerDotAI/fastcore#515 and #1296

@drscotthawley
Copy link
Contributor

drscotthawley commented Jan 5, 2024

Hey folks, what's the status on this? Not sure what/how to fix this, despite reading above.

I'm seeing both nbdev_prepare and nbdev_test for fresh installs of nbdev v 2.3.13 & 2.3.14 (from GitHub source) crashing on Apple Silicon (homebrew, Python 3.11.6, brand-new Python venv, package installs via pip):

$ nbdev_test
/AppleInternal/Library/BuildRoots/495c257e-668e-11ee-93ce-926038f30c31/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:77: failed assertion `Failed to stich function with error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)" UserInfo={NSLocalizedDescription=Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)}'
Traceback (most recent call last):
  File "/Users/shawley/envs/aeiou/bin/nbdev_test", line 8, in <module>
    sys.exit(nbdev_test())
             ^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/nbdev/test.py", line 90, in nbdev_test
    results = parallel(test_nb, files, skip_flags=skip_flags, force_flags=force_flags, n_workers=n_workers,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/parallel.py", line 117, in parallel
    return L(r)
           ^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/foundation.py", line 98, in __call__
    return super().__call__(x, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/foundation.py", line 106, in __init__
    items = listify(items, *rest, use_list=use_list, match=match)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/basics.py", line 66, in listify
    elif is_iter(o): res = list(o)
                           ^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 620, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Note: nbdev_export works fine. Only seeing the crash with nbdev_test & nbdev_prepare.

Issue is only on Mac, no problems with Linux. So for now, I'm switching to Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants