test-all #283

TomAugspurger · 2020-06-29T18:46:29Z

No description provided.

xref dask/distributed#3830, dask/distributed#3934

TomAugspurger · 2020-06-30T13:08:59Z

@jcrist when running tests locally, I see notes from the go proxy like

[I 2020-06-30 08:06:41.011 DaskGateway] Stopping cluster 4f2a02960b31401aa03ce699dcdcd98f...
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tls://127.0.0.1:49667
distributed.scheduler - INFO -   Scheduler at:     tls://127.0.0.1:49676
distributed.scheduler - INFO -   dashboard at:           127.0.0.1:49666
distributed.scheduler - INFO -     gateway at:           127.0.0.1:49677
distributed.scheduler - INFO - Scheduler closing all comms
[W 2020-06-30 08:06:41.511 Proxy] Unexpected failure fetching routing table, retrying in 1.0s: Get http://127.0.0.1:49659/api/v1/routes: dial tcp 127.0.0.1:49659: connect: connection refused

Are those expected as part of the cluster being shut down?

b9209a6 and fd8db22 are trying to clean up some strange behavior in __del__. Apparently you can get AttributeErrors in __del__: https://stackoverflow.com/a/18058854/1889400

I'm able to reproduce the other class of failures locally, but haven't been able to debug it yet

self = <asyncio.unix_events._UnixDefaultEventLoopPolicy object at 0x7f0bcf0cfc50>

    def get_event_loop(self):

        """Get the event loop.

    

        This may be None or an instance of EventLoop.

        """

        if (self._local._loop is None and

            not self._local._set_called and

            isinstance(threading.current_thread(), threading._MainThread)):

            self.set_event_loop(self.new_event_loop())

        if self._local._loop is None:

            raise RuntimeError('There is no current event loop in thread %r.'

>                              % threading.current_thread().name)

E           RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-8_0'.

TomAugspurger · 2020-07-06T21:14:48Z

I think pytest-dev/pytest-asyncio#168 is the issue to track the RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-8_0'. error. I've pinned to pytest-asyncio==0.12.0 for now.

Planning to merge when this passes. I'll open an issue to track unpinning pytest-asyncio.

TomAugspurger · 2020-07-06T21:30:50Z

Some tests failing because of a change in distributed. I'm bisecting now, but have to step away for a while.

TomAugspurger · 2020-07-07T11:20:27Z

Somehow dask/distributed#3928 broke tests/test_db_backend.py::test_gateway_resume_clusters_after_shutdown, which really doesn't make sense...

TomAugspurger · 2020-07-07T13:18:59Z

Looking at the worker logs of the test

Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/nanny.py", line 728, in _run
    worker = Worker(**worker_kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/worker.py", line 489, in __init__
    os.makedirs(local_directory, exist_ok=True)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''

so it is a permissions issue with not being able to write to the directory where the dask-worker process is started. Does anyone know why os.getcwd() would return an empty string?

scottyhq · 2020-07-14T23:03:05Z

@jcrist @TomAugspurger could this default empty string be the issue? Currently we're not able to launch gateway workers running distributed 2.20 (see linked issue)

dask-gateway/dask-gateway/dask_gateway/dask_cli.py

Line 577 in 09b5913

local_directory="",

TomAugspurger · 2020-07-15T00:50:07Z

Nice catch @scottyhq! Locally I have this.

dask-worker 192.168.7.20:8786 --local-directory=""
local directory:
making directory
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297,
in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/nanny.py", line 728, in _run
    worker = Worker(**worker_kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/worker.py", line 491, in __init__
    os.makedirs(local_directory)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
distributed.nanny - INFO - Worker process 91138 exited with status 1
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x1159e8b10>>, <Task finished coro=<Nanny._on_exit() done, defined at /Users/taugspurger/sandbox/distributed/distributed/nanny.py:440> exception=TypeError('addresses should be strings or tuples, got None')>)
Traceback (most recent call last):
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/Users/taugspurger/sandbox/distributed/distributed/nanny.py", line 443, in _on_exit
    await self.scheduler.unregister(address=self.worker_address)
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 861, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 660, in send_recv
    raise exc.with_traceback(tb)
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 513, in handle_comm
    result = await result
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 2208, in remove_worker
    address = self.coerce_address(address)
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 4946, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None
distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.7.20:53886'
distributed.dask_worker - INFO - End worker

I think the easiest short-term solution is for distributed to interpret all Falsey values like None, and look it up from the config / fall back to the cwd. I'll make a PR to distributed.

TomAugspurger · 2020-07-15T13:15:50Z

test_successful_cluster failed, not sure why, it passes locally. I restarted that job in case it's a spontaneous failure.

gforsyth

Not the first time there have been issues with pytest-asyncio but I'm surprised that it broke like this again so soon after a major-ish release.

I'll defer to @jcrist on this but this looks good -- working CI > broken CI

jcrist · 2020-07-15T14:20:37Z

dask-gateway/dask_gateway/dask_cli.py

@@ -574,7 +574,7 @@ async def start_worker(
    nthreads=1,
    memory_limit="auto",
    scheduler_address=None,
-    local_directory="",
+    local_directory=None,


Did distributed change how the local_directory parameter is handled?

Looks like this was dask/distributed#3441

And for reference, dask/distributed#3964 is achieving the same goal (fall back to config / cwd) but by changing the implementation rather than the default. Dunno what the long term plan is.

jcrist · 2020-07-15T14:25:14Z

LGTM, thanks @TomAugspurger!

TomAugspurger added 5 commits June 29, 2020 13:46

test-all

e971cc7

Compat for distributed changes.

e54da4d

xref dask/distributed#3830, dask/distributed#3934

test-all

f6107c6

Avoid warning in __del__

b9209a6

try close

fd8db22

TomAugspurger mentioned this pull request Jul 3, 2020

Create dict gateway server option #284

Closed

TomAugspurger added 2 commits July 6, 2020 16:12

pin to pytest-asyncio==0.12.0

bcbb20b

test-all

4d35de3

scottyhq mentioned this pull request Jul 14, 2020

Unable to launch dask-gateway cluster with distributed 2.20 pangeo-data/pangeo-cloud-federation#655

Closed

TomAugspurger added 2 commits July 15, 2020 06:45

specify local directory

85cbefb

test-all

4f94140

TomAugspurger mentioned this pull request Jul 15, 2020

Treat falsey local directory as None dask/distributed#3964

Merged

gforsyth approved these changes Jul 15, 2020

View reviewed changes

jcrist reviewed Jul 15, 2020

View reviewed changes

jcrist merged commit 7becded into dask:master Jul 15, 2020

TomAugspurger deleted the test-ci branch July 15, 2020 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test-all #283

test-all #283

TomAugspurger commented Jun 29, 2020

TomAugspurger commented Jun 30, 2020 •

edited

Loading

TomAugspurger commented Jul 6, 2020

TomAugspurger commented Jul 6, 2020

TomAugspurger commented Jul 7, 2020

TomAugspurger commented Jul 7, 2020

scottyhq commented Jul 14, 2020 •

edited

Loading

TomAugspurger commented Jul 15, 2020

TomAugspurger commented Jul 15, 2020

gforsyth left a comment

jcrist Jul 15, 2020

jcrist Jul 15, 2020

TomAugspurger Jul 15, 2020

jcrist commented Jul 15, 2020

test-all #283

test-all #283

Conversation

TomAugspurger commented Jun 29, 2020

TomAugspurger commented Jun 30, 2020 • edited Loading

TomAugspurger commented Jul 6, 2020

TomAugspurger commented Jul 6, 2020

TomAugspurger commented Jul 7, 2020

TomAugspurger commented Jul 7, 2020

scottyhq commented Jul 14, 2020 • edited Loading

TomAugspurger commented Jul 15, 2020

TomAugspurger commented Jul 15, 2020

gforsyth left a comment

Choose a reason for hiding this comment

jcrist Jul 15, 2020

Choose a reason for hiding this comment

jcrist Jul 15, 2020

Choose a reason for hiding this comment

TomAugspurger Jul 15, 2020

Choose a reason for hiding this comment

jcrist commented Jul 15, 2020

TomAugspurger commented Jun 30, 2020 •

edited

Loading

scottyhq commented Jul 14, 2020 •

edited

Loading