-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple workers-per-allocation not working #443
Comments
Hi :) The That being said, if you do use it, it should enqueue a Slurm allocation with 2 nodes, and then create a worker on each of those nodes. It seems that in your situation, the second step (creating the workers) has failed. I'm not really sure why, I will try to investigate the error message that you have posted. |
Hey @Kobzol , just for clarification, you mean to say that running MPI is not supported across multiple nodes (i.e. worker communication) or is it not supported in multiple cores of a single node either? I have been trying to use it for this and came across some problems (like canceling jobs not cancelling the run of the underlying script-wrapped executable), but I'm not sure if it would make sense to report them if you don't support the use case yet. |
I was talking about multi-node MPI specifically. While there's nothing stopping you from using multi-node MPI, currently you can run each task on a single node only. So HyperQueue cannot guarantee that you will get multiple nodes available for the task at the same time. There is support for multi-node tasks, but it's currently heavily experimental and WIP. Using MPI on a single node should be completely fine, you can just say how many cores does your task need. Regarding the issue that tasks may not be killed properly when a task is cancelled or a worker is killed, we are aware of it (#431). In general, please report any issues that you find (unless they are a strict duplicate of some existing issue) :) |
Hello @Kobzol , thank you for the explanation. I wouldn't need to use this command normally. But on Swiss supercomputing centre CSCS, the priority between using 1 node or 10 is the same. So launching 50 jobs on 10 nodes will be a much more efficient use of computational resources. Each of these jobs and consequently the calculations will only run on one node. And I was simply trying to bundle them together. In this case, the workers are created and hence the nodes are occupied, the jobs are also shown as running, but it is the code (, which I use to run the calculation, that outputs the error I posted and somehow fails to recognise the way HQ allocates resources. From what I understood the code shouldn't even know if there are one or multiple nodes since HQ is a handling the connection with the cluster. Does this specific use case make sense or is it something that HQ is not designed to achieve? |
Is there any specific motivation for this? Unless you overload Slurm with thousands of allocations, or you hit some Slurm permission limit for the number of allocations in the queue, it should be fine to have 50 allocations, each with a single node. To HyperQueue, it won't be any different than e.g. having just 5 allocations, each with 10 nodes.
The use case seems perfectly fine :) Could you maybe share the code that you're running (i.e. the script that you're submitting to HyperQueue)? By any chance, does your code invoke |
The reason for running those jobs together is that they dont scale well with no. of CPU cores in the cluster. Hence running 4 or 5 of those calculations bundled together on one node is a more efficient use of resources. Now I could run 50 calculations on 10 nodes using 10 seperate allocations, but then I run 10 Slurm jobs which would be prioritised lower than a single slurm job running on 10 nodes. Hence the motivation to further bundle jobs not only on one node but across the nodes. Yes the code indeed uses
And I use this command to submit the script above -
|
I see, now I understand. That is indeed a valid use case for multi-node allocations. The fact that your script uses Now, I suppose that it would be possible to make these two I'm not that familiar with SLURM, but it seems to me that you execute your program 32 times with
|
Yes, this does sound like a probable cause.
Actually no, there is a single instance running on 32 cores using a single invocation of the
This would be super helpful. I would also try to find if we could use some other command like
So we normally run our calculations using |
For simple usecases, it shouldn't be required to alter the script that is submitted to HQ, when switching e.g. from PBS/Slurm. However, the usage of
This description does not correspond to your script though. If you run I tried it on a Slurm cluster: $ srun -n4 hostname
r250n04
r250n04
r250n03
r250n03 Using So what your script basically does is that it runs 32 copies of |
I experimented with this a bit more on a Slurm cluster. Here's what I found:
I created #448 which should hopefully help with this issue. @tsthakur Can I send you a modified version of HQ with this patch applied, so that you could test if it helps? What's the CPU architecture of your cluster, x86? |
You are probably right, I am not sure how pw.x works internally but it is possible that there are 32 copies running with each bound to a single core. I will get back to you on that soon.
Yes for me the code does not launch at all.
Yes please, that would be very helpful. The architecture is x64 |
Hi @Kobzol So on the CSCS cluster, where I am running, they ask to use On other clusters, people would normally use So for now, we are stuck with using |
I see. If you're not scared of running random binaries :) I built a version of HQ which uses |
So I tried the new binary, and I still get the same error
And I still see that nothing is running on the occupied nodes. |
Does this also happen if you run your own |
Yes. I tried running with |
I see. Looks like the Slurm on your cluster in set up in a way that even tasks started with |
Yes, that seems a likely scenario. Is there a way to confirm this? Like by using error logs or something. I see that |
I suppose that the warning that you get (
|
I did encounter the same issue on the same machine (I am from same group as @tsthakur and @ramirezfranciscof). I workaround it by using @tsthakur if you still use |
What version of HyperQueue are you using? Since two years ago, HQ deploys workers in Slurm multi-node allocations using |
I use |
If you configure If you're encountering a problem that is unrelated to multiple workers per allocation, please open a new issue to avoid piggybacking on this one :) Thanks! |
So from what I understood
--workers-per-alloc <worker_count>
is used to run multiple workers on multiple nodes, but it doesn't seem to behave that way. For example if I want to run 5 calculations on one node, I use an automatic allocation with 1 worker per allocation which then launches 5 jobs (1 task per job) on 1 node. So this is working as expected. Now if I launch an allocation with 2 workers per allocation, I was expecting that my 10 calculations would run on 2 nodes, with each node having its own worker. But what happens is that 2 workers occupy 2 nodes, but any of the 10 calculations does not launch stating the following error -srun: Job 748536 step creation temporarily disabled, retrying (Requested nodes are busy)
Please note that the nodes are not in fact busy as there is one 'big' Slurm job running on these 2 nodes. But there is no process running on either nodes if I check with
top
orhtop
.I may be completely misunderstanding the purpose of this
--workers-per-alloc <worker_count>
command. In which case, would it be possible to do what I am trying to do in some other way? Or is it a use case that is not planned to be supported?The text was updated successfully, but these errors were encountered: