-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nbdev_test parallel() process re-use results in hard to debug behaviour #1287
Comments
@xl0 you need to use |
@deven367, The only difference between 0 and 1 workers is, with 0 the notebooks under test will be executed in the context of the nbdev_test process. With The resulting failure is the same in both cases. |
I agree that this is hard to debug and ideally we wouldn't leak state. It's been a while but IIRC these are two possible fixes:
|
I will take a look at number 2 once we've sorted out the mess with excessive diffs. |
I'm not super familiar with parallel programming, especially in Python, so here's a simple solution that I can think of. A partial function Since in this case |
IMO the suggestion to use maxtasksperchild is probably the right fix. |
So this was a bit of a rabbit hole. I introduced The solution is https://stackoverflow.com/questions/6974695/python-process-pool-non-daemonic : You have to subclass the appropriate context class (the thing you get from With parallel() supporting ProcessPool, the change to nbdev is trivial: @jph00 , @seeM , Does this look OK to you? I ran nbdev_test for fastcore, nbdev, and fastai, and all 3 now pass without issue. One extra thought. If someone is trying to debug their failed tests and passes |
I think this looks reasonable - it's more complicated than I'd hoped, but it sounds like there might not be a simpler option? Although I see something on that SO page saying perhaps nested pools works on on py38+? I don't think we should disallow using 0 workers for more than 1 target notebook -- people should be able to choose to do what they want. |
@jph00 , I think the SO reply refers to Please have a look at AnswerDotAI/fastcore#515 and #1296 |
Hey folks, what's the status on this? Not sure what/how to fix this, despite reading above. I'm seeing both $ nbdev_test
/AppleInternal/Library/BuildRoots/495c257e-668e-11ee-93ce-926038f30c31/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:77: failed assertion `Failed to stich function with error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)" UserInfo={NSLocalizedDescription=Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)}'
Traceback (most recent call last):
File "/Users/shawley/envs/aeiou/bin/nbdev_test", line 8, in <module>
sys.exit(nbdev_test())
^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/nbdev/test.py", line 90, in nbdev_test
results = parallel(test_nb, files, skip_flags=skip_flags, force_flags=force_flags, n_workers=n_workers,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/parallel.py", line 117, in parallel
return L(r)
^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/foundation.py", line 98, in __call__
return super().__call__(x, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/foundation.py", line 106, in __init__
items = listify(items, *rest, use_list=use_list, match=match)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/basics.py", line 66, in listify
elif is_iter(o): res = list(o)
^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 620, in _chain_from_iterable_of_lists
for element in iterable:
File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. Note: Issue is only on Mac, no problems with Linux. So for now, I'm switching to Linux. |
Run
nbdev_test --n_workers 1 --file_re "0[07].*" --do_print
infastai
repo (after installing the dev dependencies)Error I'm getting:
--file_re "0[07].*"
meaning, test notebooks 00 and 07.nbdev_test will use
fastcore
ProcessPoolExecutor
with 1 worker, which means the worker process will be re-used between notebooks.I don't completely understand the source of the error - both notebooks 00 and 07 make use of
test_fig_exists(ax)
, but only one of them fails, but pretty certain it comes from some internal state being shared between notebooks by the re-used process: No failure occurs if notebooks are run separately, or if--n_workers 2
and each notebook gets a fresh worker process.While the issue could possibly be fixed in
fastai
, I believe that the behaviour is itself a bug - a test framework should not leak state between unrelated notebooks. This behaviour results in unexpected and frustrating failures. I earlier ran into a similar issue using JAX with nbdev, when two notebooks would initialize JAX with different parameters, but since the process is being re-used, the second initialization in the same process is ignored.The text was updated successfully, but these errors were encountered: