Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gpu awareness to queue_info #825

Merged
merged 2 commits into from
Mar 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions lib/ood_core/job/adapters/slurm.rb
Original file line number Diff line number Diff line change
Expand Up @@ -327,13 +327,13 @@ def queues

[].tap do |ret_arr|
info_raw.each_line do |line|
ret_arr << str_to_acct_info(line)
ret_arr << str_to_queue_info(line)
end
end
end

private
def str_to_acct_info(line)
def str_to_queue_info(line)
hsh = line.split(' ').map do |token|
m = token.match(/^(?<key>\w+)=(?<value>.+)$/)
[m[:key], m[:value]]
Expand All @@ -349,6 +349,7 @@ def str_to_acct_info(line)


hsh[:deny_accounts] = hsh[:DenyAccounts].nil? ? [] : hsh[:DenyAccounts].to_s.split(',')
hsh[:tres] = hsh[:TRES].nil? ? {} : hsh[:TRES].to_s.split(',').map { |str| str.split('=') }.to_h

OodCore::Job::QueueInfo.new(**hsh)
end
Expand Down
8 changes: 8 additions & 0 deletions lib/ood_core/job/queue_info.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,13 @@ class OodCore::Job::QueueInfo
# The accounts that are not allowed to use this queue.
attr_reader :deny_accounts

# An Hash of Trackable Resources and their values.
attr_reader :tres

def initialize(**opts)
@name = opts.fetch(:name, 'unknown')
@qos = opts.fetch(:qos, [])
@tres = opts.fetch(:tres, {})

allow_accounts = opts.fetch(:allow_accounts, nil)
@allow_accounts = if allow_accounts.nil?
Expand All @@ -42,4 +46,8 @@ def to_h
[name, send(name)]
end.to_h
end

def gpu?
tres.keys.any? { |name| name.to_s.match?(/gpu/i) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear if this is going apply to everyone. Of course, it seems apparent to call a GPU tres .*gpu.* but I'm sure we'll run into someone who doesn't. I don't know how to handle that case. Maybe we'll need to allow for an environment variable configuration here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SLURM this would apply as there is built-in TRES/GRES plugins that have gpu prefix. I have no clue on other schedulers.

However the regex might need to be adjusted to avoid matching a non-GPU TRES that has gpu in the name. The format for Slurm is gres/gpu:<name>=<number> but you can also have just gres/gpu=<number>.

$ squeue -O tres-alloc:100 -t R | grep gpu

Some examples:

cpu=4,mem=62G,node=1,billing=4,gres/gpu=2,gres/gpu:v100-quad=2
cpu=48,mem=363G,node=1,billing=48,gres/gpu=2,gres/gpu:v100-32g=2,gres/pfsdir=0,gres/pfsdir:scratch=0

So that regex should probably be %r{gres/gpu(:|=)}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is keys, so maybe %r{^gres/gpu($|:)}. This ensures a site had a GRES named like gres/gpu-thing it wouldn't think that is a GPU job as that GRES might be something different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Yea we're splitting here for keys and values so for example these 2

gres/gpu=2,gres/gpu:v100-32g=2

get split and extracted into the hash

{
  'gres/gpu': 2,
  'gres/gpu:v100-32g': 2
}

However the regex might need to be adjusted to avoid matching a non-GPU TRES that has gpu in the name.

Yea I think this is what I was worried about, so I can update the regex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SLURM this would apply as there is built-in TRES/GRES plugins that have gpu prefix. I have no clue on other schedulers.

Yea, same. A lot of this stuff will be Slurm only until someone can provide a patch for other schedulers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have another question about the model - is the Slurm plugin guaranteed to have the GPU model in the name as well?

Taking this for example, is every Slurm site guaranteed to list out all the GPY models like this queue having 2 v100-32g?

gres/gpu=2,gres/gpu:v100-32g=2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking this for example, is every Slurm site guaranteed to list out all the GPY models like this queue having 2 v100-32g?

I do not believe that's guaranteed. This is the "type" in the GRES: https://slurm.schedmd.com/gres.conf.html#OPT_Type. It is documented as optional. Even if a site does specify the type, I'm not 100% certain it would show up in TRES unless the site also includes into the accounting: https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageTRES

I'm not 100% certain if accounting TRES configs affect job TRES availability. Either way the type of GPU is optional so that's not guaranteed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe that's guaranteed.

OK, cool thanks for the info.

end
end
5 changes: 5 additions & 0 deletions spec/job/adapters/slurm_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -1333,11 +1333,16 @@ def job_info(opts = {})
expect(systems_queue.allow_accounts).to eq(['root', 'pzs0708', 'pzs0710', 'pzs0722'])
expect(systems_queue.deny_accounts).to eq([])
expect(systems_queue.qos).to eq([])
expect(systems_queue.gpu?).to eq(true)

quick_queue = queues.select { |q| q.name == 'quick' }.first
expect(quick_queue.allow_accounts).to eq(nil)
expect(quick_queue.deny_accounts).to eq(quick_deny_accounts)
expect(quick_queue.qos).to eq(['quick'])
expect(quick_queue.gpu?).to eq(false)

gpu_queue = queues.select { |q| q.name == 'gpuserial' }.first
expect(gpu_queue.gpu?).to eq(true)
end
end

Expand Down
Loading