Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

levelzero: only use Sysman queries instead of similar Core API queries #595

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bgoglin
Copy link
Contributor

@bgoglin bgoglin commented Jun 12, 2023

This goes on top of #594 which might be merged once a compute-runtime release brings a working zesInit(). This PR requires zesInit() to be more widely available, especially the last commit which completely remove the old setting of ZES_ENABLE_SYSMAN=1 in the env.
Now that we require zesInit(), just use ZES queries instead of switching from/to the Core API depending on ZES being available or not.

@bgoglin bgoglin force-pushed the l0-always-sysman branch 2 times, most recently from d978732 to 980e829 Compare June 20, 2023 09:32
@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 8, 2024

This should be rebased/simplified on top of #695

@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 13, 2024

@saik-intel Now that we use zesInit(), I guess there's no reason to query both zeDevicePciGetPropertiesExt() and zesDevicePciGetProperties() in case one fails but no the other. The later should always be supported now, right?
Same question for zeDeviceGetMemoryProperties() vs zesDeviceEnumMemoryModules()+zesMemoryGetProperties(), is there any reason to prefer one of the other? We just want to know how much memory each device or subdevice has, and if that memory is HBM/DRAM/etc ?

Now that zesInit() is mandatory, don't bother falling back
to the core API, Sysman shouldn't fail.

Signed-off-by: Brice Goglin <[email protected]>
Now that zesInit() is mandatory, don't bother trying the core API extension
first in case sysman wouldn't be available.
Hence remove the zeDevicePciGetPropertiesExt() optional detection

Signed-off-by: Brice Goglin <[email protected]>
We don't need it anymore.

Signed-off-by: Brice Goglin <[email protected]>
@bgoglin bgoglin changed the title [WIP DNM] L0 always sysman levelzero: only use Sysman queries instead of similar Core API queries Nov 27, 2024
@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 27, 2024

@TApplencourt Could you please run lstopo aurora.xml from this branch using the tarball from https://ci.inria.fr/hwloc/job/basic/view/change-requests/job/PR-595/ ? I'd like to double-check that we don't loose any info when removing non-Sysman queries.

@TApplencourt
Copy link

Of course! Here it's:
aurora.xml.txt

At first glance, Everything looks good to me.

$ applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ./ici/bin/lstopo  | grep "ze"
              CoProc(LevelZero) "ze0"
                CoProc(LevelZero) "ze0.0"
                CoProc(LevelZero) "ze0.1"
              CoProc(LevelZero) "ze1"
                CoProc(LevelZero) "ze1.0"
                CoProc(LevelZero) "ze1.1"
              CoProc(LevelZero) "ze2"
                CoProc(LevelZero) "ze2.0"
                CoProc(LevelZero) "ze2.1"
              CoProc(LevelZero) "ze3"
                CoProc(LevelZero) "ze3.0"
                CoProc(LevelZero) "ze3.1"
              CoProc(LevelZero) "ze4"
                CoProc(LevelZero) "ze4.0"
                CoProc(LevelZero) "ze4.1"
              CoProc(LevelZero) "ze5"
                CoProc(LevelZero) "ze5.0"
                CoProc(LevelZero) "ze5.1"
$applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ./ici/bin/lstopo  | grep "cl"
              CoProc(OpenCL) "opencl0d0"
              CoProc(OpenCL) "opencl0d1"
              CoProc(OpenCL) "opencl0d2"
              CoProc(OpenCL) "opencl0d3"
              CoProc(OpenCL) "opencl0d4"
              CoProc(OpenCL) "opencl0d5"

But Make check reported an error:

applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> cat tests/hwloc/test-suite.log
========================================================================
   hwloc PR-595-20241113.1537.gitc2b2cfd7: tests/hwloc/test-suite.log
========================================================================

# TOTAL: 46
# PASS:  45
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: levelzero
===============

levelzero: ../../../tests/hwloc/levelzero.c:188: int main(void): Assertion `atoi(osdev->name+2) == (int) k' failed.
./wrapper.sh: line 32: 175099 Aborted                 "$@"
FAIL levelzero (exit status: 134)

Composite or FLat doesn't change anything:

applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ZE_FLAT_DEVICE_HIERARCHY=FLAT ./tests/hwloc/levelzero
testing ZE devices
found 1 L0 drivers
found 12 L0 devices in driver #0
found OSDev ze0
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #0 device #0
found OSDev ze0
levelzero: ../../../tests/hwloc/levelzero.c:100: int main(void): Assertion `atoi(osdev->name+2) == (int) k' failed.
Aborted
applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE ./tests/hwloc/levelzero
testing ZE devices
found 1 L0 drivers
found 6 L0 devices in driver #0
found OSDev ze0
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #0 device #0
found OSDev ze1
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #1 device #0
found OSDev ze2
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #2 device #0
found OSDev ze3
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #3 device #0
found OSDev ze4
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #4 device #0
found OSDev ze5
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #5 device #0
testing ZES devices
found 1 L0 ZES drivers
found 6 L0 ZES devices in driver #0
found OSDev ze1
levelzero: ../../../tests/hwloc/levelzero.c:188: int main(void): Assertion `atoi(osdev->name+2) == (int) k' failed.
Aborted

Using:

applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ze_info
Number of drivers                                 1
  Driver API Version                              1.5
  Driver Version                                  17004696
$intel_compute_runtime/release/996.26

@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 28, 2024

Thanks. There's at least one bug in the test file, I'll try to debug more your report.

Aside of that it looks like ZES fails to report a valid PCI link speed (it shows 0.25GB/s instead of 63 in your case, and nothing on my machines). I'll revert that part and use ZE for this for now.

@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 28, 2024

Could you comment out the assert on line 188 of tests/hwloc/levelzero.c, make -C tests/hwloc levelzero && tests/hwloc/levelzero? I'd like to confirm that devices are reported by ZES and ZE in different orders. That'd be funny, but easy to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants