-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add gpu awareness to queue_info #825
Conversation
lib/ood_core/job/queue_info.rb
Outdated
@@ -42,4 +46,8 @@ def to_h | |||
[name, send(name)] | |||
end.to_h | |||
end | |||
|
|||
def gpu? | |||
tres.keys.any? { |name| name.to_s.match?(/gpu/i) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear if this is going apply to everyone. Of course, it seems apparent to call a GPU tres .*gpu.*
but I'm sure we'll run into someone who doesn't. I don't know how to handle that case. Maybe we'll need to allow for an environment variable configuration here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SLURM this would apply as there is built-in TRES/GRES plugins that have gpu
prefix. I have no clue on other schedulers.
However the regex might need to be adjusted to avoid matching a non-GPU TRES that has gpu
in the name. The format for Slurm is gres/gpu:<name>=<number>
but you can also have just gres/gpu=<number>
.
$ squeue -O tres-alloc:100 -t R | grep gpu
Some examples:
cpu=4,mem=62G,node=1,billing=4,gres/gpu=2,gres/gpu:v100-quad=2
cpu=48,mem=363G,node=1,billing=48,gres/gpu=2,gres/gpu:v100-32g=2,gres/pfsdir=0,gres/pfsdir:scratch=0
So that regex should probably be %r{gres/gpu(:|=)}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is keys, so maybe %r{^gres/gpu($|:)}
. This ensures a site had a GRES named like gres/gpu-thing
it wouldn't think that is a GPU job as that GRES might be something different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Yea we're splitting here for keys and values so for example these 2
gres/gpu=2,gres/gpu:v100-32g=2
get split and extracted into the hash
{
'gres/gpu': 2,
'gres/gpu:v100-32g': 2
}
However the regex might need to be adjusted to avoid matching a non-GPU TRES that has gpu in the name.
Yea I think this is what I was worried about, so I can update the regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SLURM this would apply as there is built-in TRES/GRES plugins that have gpu prefix. I have no clue on other schedulers.
Yea, same. A lot of this stuff will be Slurm only until someone can provide a patch for other schedulers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do have another question about the model - is the Slurm plugin guaranteed to have the GPU model in the name as well?
Taking this for example, is every Slurm site guaranteed to list out all the GPY models like this queue having 2 v100-32g
?
gres/gpu=2,gres/gpu:v100-32g=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking this for example, is every Slurm site guaranteed to list out all the GPY models like this queue having 2
v100-32g
?
I do not believe that's guaranteed. This is the "type" in the GRES: https://slurm.schedmd.com/gres.conf.html#OPT_Type. It is documented as optional. Even if a site does specify the type, I'm not 100% certain it would show up in TRES unless the site also includes into the accounting: https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageTRES
I'm not 100% certain if accounting TRES configs affect job TRES availability. Either way the type of GPU is optional so that's not guaranteed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe that's guaranteed.
OK, cool thanks for the info.
add gpu awareness to queue_info so upper layers like OOD can make decisions on it.