Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gpu awareness to queue_info #825

Merged
merged 2 commits into from
Mar 29, 2024
Merged

add gpu awareness to queue_info #825

merged 2 commits into from
Mar 29, 2024

Conversation

johrstrom
Copy link
Contributor

add gpu awareness to queue_info so upper layers like OOD can make decisions on it.

@@ -42,4 +46,8 @@ def to_h
[name, send(name)]
end.to_h
end

def gpu?
tres.keys.any? { |name| name.to_s.match?(/gpu/i) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear if this is going apply to everyone. Of course, it seems apparent to call a GPU tres .*gpu.* but I'm sure we'll run into someone who doesn't. I don't know how to handle that case. Maybe we'll need to allow for an environment variable configuration here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SLURM this would apply as there is built-in TRES/GRES plugins that have gpu prefix. I have no clue on other schedulers.

However the regex might need to be adjusted to avoid matching a non-GPU TRES that has gpu in the name. The format for Slurm is gres/gpu:<name>=<number> but you can also have just gres/gpu=<number>.

$ squeue -O tres-alloc:100 -t R | grep gpu

Some examples:

cpu=4,mem=62G,node=1,billing=4,gres/gpu=2,gres/gpu:v100-quad=2
cpu=48,mem=363G,node=1,billing=48,gres/gpu=2,gres/gpu:v100-32g=2,gres/pfsdir=0,gres/pfsdir:scratch=0

So that regex should probably be %r{gres/gpu(:|=)}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is keys, so maybe %r{^gres/gpu($|:)}. This ensures a site had a GRES named like gres/gpu-thing it wouldn't think that is a GPU job as that GRES might be something different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Yea we're splitting here for keys and values so for example these 2

gres/gpu=2,gres/gpu:v100-32g=2

get split and extracted into the hash

{
  'gres/gpu': 2,
  'gres/gpu:v100-32g': 2
}

However the regex might need to be adjusted to avoid matching a non-GPU TRES that has gpu in the name.

Yea I think this is what I was worried about, so I can update the regex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SLURM this would apply as there is built-in TRES/GRES plugins that have gpu prefix. I have no clue on other schedulers.

Yea, same. A lot of this stuff will be Slurm only until someone can provide a patch for other schedulers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have another question about the model - is the Slurm plugin guaranteed to have the GPU model in the name as well?

Taking this for example, is every Slurm site guaranteed to list out all the GPY models like this queue having 2 v100-32g?

gres/gpu=2,gres/gpu:v100-32g=2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking this for example, is every Slurm site guaranteed to list out all the GPY models like this queue having 2 v100-32g?

I do not believe that's guaranteed. This is the "type" in the GRES: https://slurm.schedmd.com/gres.conf.html#OPT_Type. It is documented as optional. Even if a site does specify the type, I'm not 100% certain it would show up in TRES unless the site also includes into the accounting: https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageTRES

I'm not 100% certain if accounting TRES configs affect job TRES availability. Either way the type of GPU is optional so that's not guaranteed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe that's guaranteed.

OK, cool thanks for the info.

@johrstrom johrstrom merged commit c397b50 into master Mar 29, 2024
3 checks passed
@johrstrom johrstrom deleted the gpu-queues branch March 29, 2024 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants