-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask autolaunching issue with conda package #531
Comments
The only possibility is that you have a dask scheduler running for some reason that you are not aware of. I am pretty sure that the particular mongodb container will not be the issue because it does not map the port 8787 at all. Have you tried the |
In a meeting a short time ago @wangyinz gave some suggestions that helped solve this problem. The cause and solution are explained well here. The problem that created this was I was using suggestions I found online that were apparently wrong. I did this:
The second line was what was throwing the error. It appears to not be necessary. The source referenced above shows what seems to be the right way to handle this. That is, you more or less create the scheduler and worker as a separate process much as we do in the container. Then the processing job just connects to the instance of "LocalCluster" created externally. There seems to be a lot of misinformation on this topic on the web. A lot of it is created, actually, by implicit launches of dask that can occur in some contexts like creating and using a bag. I am going to mark this issue closed. |
I acted prematurely in closing this issue. The potential solution I noted above did not fix this problem, but actually kind of made it worse. What happened, however, I think suggests what the fundamental problem is that is causing this behavior. What I observed is if I launch dask-scheduler and a dask-worker in separate windows, which makes them run as an external abstraction of a cluster definition, when I do our stock incantation to access a database:
It immediately issues the complain above about port 8787 already being in use. In this case that is expected since I launched dask independently. The thing this shows, however, is that instantiating an instance of Database is causing dask to be invoked somewhere. When I look at the client.py file I see that all
Why are those necessary? I haven't dug into the monstrosity of the modules that are referenced there, but from the behavior it looks like they don't just load code but do some kind of initialization. I don't know how or even if you could do that in a python module, but that seems to be what this is doing. dask is ALWAY referenced in the Database class in the DataFrame sections but none of that is referenced in the constructor. If someone in the group can't enlighten me on this I'll need to step into the code with an interactive debugger to try to sort this out. I realize I lack a fundamental understanding of what happens in an "import" command. This behavior, by the way, is troublesome for our near term plans as it could break the cloud implementation with "coiled" that depends on the conda package. The concern is something we have here will always break if an instance of the abstraction of a dask distributed cluster (i.e. LocalCluster, SLURMCluster, etc.) is used with our Database class. That is a pessimistic view. I think it is more likely only an issue with LocalCluster as port contention only happens if you are running the components of mspass on a single host. When we run with the container none of this could happen when mspass is run on a cluster because each "role" is isolated to a container. When running on a desktop, however, that isolation does not occur. That, at least, is my working hypothesis to explain this behavior. |
Those import lines are irrelevant. They are used for the code to figure out whether we have dask or pyspark installed, such that the default API of corresponding one can be used. This is now obsolete inside the database module because the move of the With that being said, I think you found exactly the problem, which is inside the constructor of our client: mspass/python/mspasspy/client.py Lines 206 to 235 in 33e74d8
As you can see at line 229, if a dask scheduler is not being detected, it actually tries to create one. It actually make sense, but that means users are not expected to created their own scheduler. |
I think two things are needed to address this issue and @wangyinz is the only one the group who can effectively accomplish these:
|
One small correction/addition to the previous comment. The docstring for the Database constructor actually mentions that it should be used only for serial jobs. So, we have at least an oblique reference to this problem in our documentation. It still screams to have the docstring for mspasspy.client.Client improved and a User Manual section written on this topic. |
Also, I confirm that that approach does address that problem when using the conda package. BUT there is a bit caveat. The Database constructor fails UNLESS the environment variable MSPASS_HOME is define AND contains the default mspass.yaml schema file in the data/yaml directory. I do not know if that can be handled in the package definition. A lot of packages have some kind of data directory to hold various required data files. If that cannot be done automatically it absolutely will require clear directions in the User Manual. |
I am a little confused. I thought we don't need the mspass/python/mspasspy/db/schema.py Lines 23 to 26 in 33e74d8
I guess the error you saw may be from other places than this schema class. Do you have a back trace of that? We only need to add the same handling in whatever code that is throwing the error. |
That works when running with the container but not for a use with the conda package. In that environment pwd can be anything. Then again, maybe I misunderstand what file resolves to in that code. All I know is it failed but when I defined MSPASS_HOME the same code worked. |
The |
hmmm.... weird, I just checked and it seems the conda install does include the data dir at the correct location... |
I just tried with a clean conda install and the following code works fine:
I also tried calling the BTW, I checked and it shows that the above code correctly started a local dask cluster, so it works as expected:
|
After your experience I think I understand what is happening here. I have the conda package superimposed on a local build I installed via
run in the top level of the mspass repository. Indeed when I look I see that the local library is overriding the conda package. This is a good lesson because I can see two uses for the conda package:
The solution to this issue is thus in the documentation. I suggest we proceed in two stages:
This issue also may be the solution to resolving the problem we have encountered build the arm64 version of the conda package. Recall the obspy dependency of mspass presented a problem for the package and seems to require some other solution like pip to install obspy on arm64 machines. Installing obspy via pip is effectively the same issue that caused me to start this issue as I was using conflicting packages installed with pip and conda. This is, in fact, a case in point about the biggest single problem with python that makes the container such an important solution for MsPaSS: the nearly inevitable python package conflicts that happen when mixing package managers. |
You all can kick me, but much of the confusion in the later part of this post was created by a blunder on my part. The problem I had with MSPASS_HOME was due to a blunder in activating the wrong environment with anaconda. The original reason for posting this issue, however, remain important. That is, the MsPASS client can launch a second instance of dask in some situations. I don't think this issue should be closed until we fill the documentation gaps noted above. That means:
|
hmmm.... I am still a bit confused. Even if you are using the mspasspy installed by |
Well, hmmm back. I'm unable to recreate the error I had that I thought required MSPASS_HOME. For now presume I did something else wrong and jumped to the wrong conclusion. The problem of dask be relaunched is still there and something to document for running serial jobs using the conda package. Sorry to make you chase this. |
I'm running into some odd behavior trying to use dask with the new "LocalCluster" functionality using the new conda package. The "bottom line" is that when I create an instance of LocalCluster in a jupyter notebook running locally (not in the docker container but locally) I get this error message:
I can connect to the standard dask dashboard after that message on port 34081, but when I try to do any dask operation it fails with a long exception chain ending with this:
The message is clearly showing that my notebook instance is trying to connect to the scheduler on the wrong port. Thus, even though I could connect to the diagnostic dashboard dask will not run in this context.
I do not know where this is coming from. Note:
I was running with the new conda package and using the technique I posted in the wiki a couple weeks ago. That is, I was using the container only intending to use it to run MongoDB. The launch line was this:
It was definitely working to run Mongo as the stuff I was running was using MongoDB and it all worked fine. The problem surfaced only when I tried to use dask. I conclude the docker container is the "smoking gun" for causing this problem because as soon as I stopped the container running I could instantiate an instance of LocalCluster without getting the error shown above.
I looked through the startup script and I cannot see how this is happening. When "MSPASS_ROLE" is set to "db" as above nothing I can see references dask. Hence, it is possible the observation that the gun is smoking doesn't mean the docker container is the killer and this is happening some other way. Do any of you have any idea how I we can sort this out?
The text was updated successfully, but these errors were encountered: