Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zesDevicePciGetProperties() reports invalid PCI bandwidth (zeDevicePciGetPropertiesExt() is correct) #778

Open
bgoglin opened this issue Nov 28, 2024 · 0 comments
Labels
L0 Sysman Issue related to L0 Sysman

Comments

@bgoglin
Copy link

bgoglin commented Nov 28, 2024

Hello
Why finishing the switch from ZES_ENABLE_SYSMAN=1 to zesInit() in hwloc, I have to remove some duplicate code that was used in the past when Sysman() could not be enabled. One of them is the query of the PCI properties of the device.
I was notified that PVC gets different PCI maxBandwidth on Aurora from zesDevicePciGetProperties() and zeDevicePciGetPropertiesExt()
(open-mpi/hwloc#595 (comment)). ZE reports 63GB/s as expected.
ZES reports 0.25GB/s instead. The reason could be that one reports the max possible value while the other reports the current (possibly idle) value, but the ZES doc says " The maximum bandwidth in bytes/sec (sum of all lanes)" anyway hence 0.25 doesn't make sense.
I don't have access to Aurora to debug further. I tested on other platforms but they seem to have older releases of the runtime (including on your endeavour cluster), and they just report -1 from ZES anyway (ZE is correct there too).

@JablonskiMateusz JablonskiMateusz added the L0 Sysman Issue related to L0 Sysman label Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
L0 Sysman Issue related to L0 Sysman
Projects
None yet
Development

No branches or pull requests

2 participants