Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-all #283

Merged
merged 9 commits into from
Jul 15, 2020
Merged

test-all #283

merged 9 commits into from
Jul 15, 2020

Conversation

TomAugspurger
Copy link
Member

No description provided.

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Jun 30, 2020

@jcrist when running tests locally, I see notes from the go proxy like

[I 2020-06-30 08:06:41.011 DaskGateway] Stopping cluster 4f2a02960b31401aa03ce699dcdcd98f...
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tls://127.0.0.1:49667
distributed.scheduler - INFO -   Scheduler at:     tls://127.0.0.1:49676
distributed.scheduler - INFO -   dashboard at:           127.0.0.1:49666
distributed.scheduler - INFO -     gateway at:           127.0.0.1:49677
distributed.scheduler - INFO - Scheduler closing all comms
[W 2020-06-30 08:06:41.511 Proxy] Unexpected failure fetching routing table, retrying in 1.0s: Get http://127.0.0.1:49659/api/v1/routes: dial tcp 127.0.0.1:49659: connect: connection refused

Are those expected as part of the cluster being shut down?

b9209a6 and fd8db22 are trying to clean up some strange behavior in __del__. Apparently you can get AttributeErrors in __del__: https://stackoverflow.com/a/18058854/1889400

I'm able to reproduce the other class of failures locally, but haven't been able to debug it yet

self = <asyncio.unix_events._UnixDefaultEventLoopPolicy object at 0x7f0bcf0cfc50>

    def get_event_loop(self):

        """Get the event loop.

    

        This may be None or an instance of EventLoop.

        """

        if (self._local._loop is None and

            not self._local._set_called and

            isinstance(threading.current_thread(), threading._MainThread)):

            self.set_event_loop(self.new_event_loop())

        if self._local._loop is None:

            raise RuntimeError('There is no current event loop in thread %r.'

>                              % threading.current_thread().name)

E           RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-8_0'.

@TomAugspurger
Copy link
Member Author

I think pytest-dev/pytest-asyncio#168 is the issue to track the RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-8_0'. error. I've pinned to pytest-asyncio==0.12.0 for now.

Planning to merge when this passes. I'll open an issue to track unpinning pytest-asyncio.

@TomAugspurger
Copy link
Member Author

Some tests failing because of a change in distributed. I'm bisecting now, but have to step away for a while.

@TomAugspurger
Copy link
Member Author

Somehow dask/distributed#3928 broke tests/test_db_backend.py::test_gateway_resume_clusters_after_shutdown, which really doesn't make sense...

@TomAugspurger
Copy link
Member Author

Looking at the worker logs of the test

Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/nanny.py", line 728, in _run
    worker = Worker(**worker_kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/worker.py", line 489, in __init__
    os.makedirs(local_directory, exist_ok=True)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''

so it is a permissions issue with not being able to write to the directory where the dask-worker process is started. Does anyone know why os.getcwd() would return an empty string?

@scottyhq
Copy link

scottyhq commented Jul 14, 2020

@jcrist @TomAugspurger could this default empty string be the issue? Currently we're not able to launch gateway workers running distributed 2.20 (see linked issue)

@TomAugspurger
Copy link
Member Author

Nice catch @scottyhq! Locally I have this.

dask-worker 192.168.7.20:8786 --local-directory=""
local directory:
making directory
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297,
in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/nanny.py", line 728, in _run
    worker = Worker(**worker_kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/worker.py", line 491, in __init__
    os.makedirs(local_directory)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
distributed.nanny - INFO - Worker process 91138 exited with status 1
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x1159e8b10>>, <Task finished coro=<Nanny._on_exit() done, defined at /Users/taugspurger/sandbox/distributed/distributed/nanny.py:440> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/Users/taugspurger/sandbox/distributed/distributed/nanny.py", line 443, in _on_exit
    await self.scheduler.unregister(address=self.worker_address)
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 861, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 660, in send_recv
    raise exc.with_traceback(tb)
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 513, in handle_comm
    result = await result
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 2208, in remove_worker
    address = self.coerce_address(address)
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 4946, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.7.20:53886'
distributed.dask_worker - INFO - End worker

I think the easiest short-term solution is for distributed to interpret all Falsey values like None, and look it up from the config / fall back to the cwd. I'll make a PR to distributed.

@TomAugspurger
Copy link
Member Author

test_successful_cluster failed, not sure why, it passes locally. I restarted that job in case it's a spontaneous failure.

Copy link
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the first time there have been issues with pytest-asyncio but I'm surprised that it broke like this again so soon after a major-ish release.

I'll defer to @jcrist on this but this looks good -- working CI > broken CI

@@ -574,7 +574,7 @@ async def start_worker(
nthreads=1,
memory_limit="auto",
scheduler_address=None,
local_directory="",
local_directory=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did distributed change how the local_directory parameter is handled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was dask/distributed#3441

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for reference, dask/distributed#3964 is achieving the same goal (fall back to config / cwd) but by changing the implementation rather than the default. Dunno what the long term plan is.

@jcrist
Copy link
Member

jcrist commented Jul 15, 2020

LGTM, thanks @TomAugspurger!

@jcrist jcrist merged commit 7becded into dask:master Jul 15, 2020
@TomAugspurger TomAugspurger deleted the test-ci branch July 15, 2020 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants