Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

listing available gpus #3

Open
dougbevan opened this issue Nov 19, 2018 · 13 comments
Open

listing available gpus #3

dougbevan opened this issue Nov 19, 2018 · 13 comments

Comments

@dougbevan
Copy link

A very useful software. How can we list the available vs used GRES for gpus?

For instance, if I do:

pestat -G

This is partially good, as I can see the GRES being used. But it doesn't show the GRES available.

For CPUs, you get to see used/total (in my case 0/48). How can I get a similar output for gpus?

@OleHolmNielsen
Copy link
Owner

I believe the sinfo command will give you the desired information about GRES in nodes.
For example, use this command:

sinfo -o "%P %G %D %N"

@dougbevan
Copy link
Author

That does give the total GPUs. It would be amazing to have output like this in pestat though, which give a number of useful metrics all in one output and give a great "quick glance" for our users.

With pestat -G we get a great output for cpus like:

Use/Tot
0 48

It would be useful to also see something like:

GRES GPUs
Use / Tot
2 8

@OleHolmNielsen
Copy link
Owner

I understand now, so I've added a new column GRES/node which is printed if you select the -G flag.
Can you try out the new script and tell me if this does what you want?

@dougbevan
Copy link
Author

This is excellent. I tried it on one of our single node systems, and I see the available gpu and the GRES/job. Thanks for the addition -- this will be quite useful.

@OleHolmNielsen
Copy link
Owner

I'm glad this works for you! Please report any issues back to me.

@cheekykite
Copy link

Hello.

Thanks for providing a good tool.

"GRES/job" is not showing up in a clustered environment.
Can I get an opinion?

master:pestat]#
master:pestat]# ./pestat  -G
GRES (Generic Resource) is printed after each jobid
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  GRES/   Joblist
                            State Use/Tot              (MB)     (MB)  node    JobId User GRES/job ...
      n1        titanxp*     idle   0   6    0.07     60000    62869  gpu:TitanXP:2
      n2        titanxp*     idle   0   6    0.01     60000    62952  gpu:TitanXP:2
      n3        titanxp*     idle   0   6    0.01     60000    62860  gpu:TitanXP:2
      n4        titanxp*     idle   0   6    0.01     60000    62891  gpu:TitanXP:2
      n5        titanxp*     idle   0   6    0.01     60000    62971  gpu:TitanXP:2
      n6        titanxp*     idle   0   6    0.09     60000    62945  gpu:TitanXP:2
      n7        titanxp*     idle   0   6    0.01     60000    63096  gpu:TitanXP:2
      n8        titanxp*     idle   0   6    0.02     60000    63084  gpu:TitanXP:2
      n9        titanxp*      mix   4   6    2.39*    60000    49649  gpu:TitanXP:2 2367 sonic  2360 sonic
     n10        titanxp*     idle   0   6    0.01     60000    63082  gpu:TitanXP:2
master:pestat]#
master:pestat]#
master:pestat]# sinfo --version
slurm 18.08.8
master:pestat]#
master:pestat]# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)
master:pestat]#
master:pestat]#

@OleHolmNielsen
Copy link
Owner

You're running an old and obsolete version of Slurm. Later versions have significantly improved GPU support, so maybe that's why you don't get the expected information.

The pestat command obtains information from Slurm with:
sinfo -h -N $partition $hostlist $statelist -o "%N %P %C %O %m %e %t %Z %G"
where the %G option prints:
%G Generic resources (gres) associated with the nodes.
Please check "man sinfo" in your Slurm version to see if %G exists.

@OleHolmNielsen
Copy link
Owner

Can you please test the latest version of pestat? The GRES/job is now being printed correctly.

@clue2
Copy link

clue2 commented Aug 17, 2022

It might be helpful to change the formatting from -o to -O to make use of the extra formatting options (such as GresUsed)

sinfo -h -N $partition $hostlist $statelist -o "%N %P %C %O %m %e %t %Z %G"

becomes:

sinfo -h -N $partition $hostlist $statelist -O "Nodes,Partition,CPUsState,CPUsLoad,Memory,FreeMem,StateCompact,Threads,Gres"
Screen Shot 2022-08-17 at 2 18 05 pm

You could then add in GresUsed ( ideally cleaning it up a bit ) to achieve a more helpful overview of how many GPUs are in use/available on a node

Screen Shot 2022-08-17 at 2 18 16 pm

@yzs981130
Copy link

change the formatting from -o to -O

$prefix/sinfo -h -N $all_partitions $partition $hostlist $statelist -O "NodeList:30,Partition:30,CPUsState:30,CPUsLoad:30,Memory:30,FreeMem:30,StateCompact:30,Threads:30,Gres:30" | $my_awk '

I believe pestat has already used -O to retrieve information.

I am experiencing the same problem with you, to add a node-level GresUsed in the output of pestat. Therefore, I added it in my personal fork: yzs981130@7e711af. Hope it can help you!

cc @OleHolmNielsen What do you think about the node-level GresUsed? Since it is my first time using awk, I could send a draft pr if you think it is also needed.

@clue2
Copy link

clue2 commented Aug 22, 2022

You're right, it does use -O now - I hadn't actually checked the code & was just going by the comments above. Thanks!

In that case, just changing Gres to GresUsed does a good enough job
Screen Shot 2022-08-22 at 10 51 13 am

In your fork the formatting has become a bit off for me:
Screen Shot 2022-08-22 at 10 49 41 am

@OleHolmNielsen
Copy link
Owner

Thank you for your suggestion.
The GRES output shows how many GPUs are physically in the node.

With "pestat -G" the GRES used by each job on the node is printed. One could count manually how many GPUs are used.

I agree that the "sinfo -O GRESUSED" gives a useful summary of how many GPUs are in use.

However, I think that printing both GRES and GRESUSED data makes the output very long and difficult to read.

Maybe one could think of simplifying by having a "Num_GPU" column with simply the "Use/Tot" numbers. Some complicated parsing of GRES and GRESUSED would be needed.

There could be non-GPU types of GRES, see https://slurm.schedmd.com/gres.conf.html

Do you have suggestions for making the output of pestat more useful and simple to read?

@OleHolmNielsen
Copy link
Owner

Note added: Sites have to define their own GRES types in slurm.conf using the GresTypes parameter.
It can become complex for "pestat" to decode all possible GRES types and extract numbers for "Use/Tot".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants