Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm support enabled in nocuda builds and installed in non-AMD machines? #64

Closed
1 task done
traversaro opened this issue Aug 15, 2023 · 6 comments
Closed
1 task done
Labels

Comments

@traversaro
Copy link
Contributor

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

Since #62 have been merged, all programs that use hwloc if installed and run on a machine without any cuda or rocm graphic card, print an error message related to failure in rocm initialization.

See for example hwloc-ls :

(libhwloc) traversaro@IITICUBLAP257:~/mambaforge/envs/libhwloc/include$ hwloc-ls
Exception caught: rsmi_init.
hwloc/rsmi: Failed to initialize with rsmi_init(): RSMI_STATUS_INIT_ERROR: An error occurred during initialization, during monitor discovery or when when initializing internal data structures
Machine (15GB total)
  Package L#0
    NUMANode L#0 (P#0 15GB)
    L3 L#0 (12MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
  HostBridge
    PCI 022f:00:00.0 (3D)
  HostBridge
    PCI 1fcc:00:00.0 (3D)
  HostBridge
    PCI 2564:00:00.0 (SCSI)
  HostBridge
    PCI 50c2:00:00.0 (SCSI)
  HostBridge
    PCI 968a:00:00.0 (SCSI)
  HostBridge
    PCI e0bd:00:00.0 (SCSI)
  Block(Disk) "sdb"
  Block(Disk) "sdc"
  Block(Disk) "sda"
  Net "eth0"

or hwloc-info :

 (libhwloc) traversaro@IITICUBLAP257:~$ hwloc-info
Exception caught: rsmi_init.
depth 0:           1 Machine (type #0)
 depth 1:          1 Package (type #1)
  depth 2:         1 L3Cache (type #6)
   depth 3:        6 L2Cache (type #5)
    depth 4:       6 L1dCache (type #4)
     depth 5:      6 L1iCache (type #9)
      depth 6:     6 Core (type #2)
       depth 7:    12 PU (type #3)
Special depth -3:  1 NUMANode (type #13)
Special depth -4:  6 Bridge (type #14)
Special depth -5:  6 PCIDev (type #15)
Special depth -6:  4 OSDev (type #16)

The return code of the program is still 0 (i.e. success), but anyhow I was wondering if this was an intended behaviour, as it may be confusing for users.

Installed packages

(libhwloc) traversaro@IITICUBLAP257:~/mambaforge/envs/libhwloc/include$ conda list
# packages in environment at /home/traversaro/mambaforge/envs/libhwloc:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
icu                       72.1                 hcb278e6_0    conda-forge
libgcc-ng                 13.1.0               he5830b7_0    conda-forge
libgomp                   13.1.0               he5830b7_0    conda-forge
libhwloc                  2.9.2           nocuda_h7313eea_1008    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
libstdcxx-ng              13.1.0               hfd8a6a1_0    conda-forge
libxml2                   2.11.5               h0d562d8_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
rocm-smi                  5.6.0                h59595ed_1    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Environment info

(libhwloc) traversaro@IITICUBLAP257:~/mambaforge/envs/libhwloc/include$ conda info

     active environment : libhwloc
    active env location : /home/traversaro/mambaforge/envs/libhwloc
            shell level : 1
       user config file : /home/traversaro/.condarc
 populated config files : /home/traversaro/mambaforge/.condarc
                          /home/traversaro/.condarc
          conda version : 23.3.1
    conda-build version : 3.25.0
         python version : 3.10.10.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=12.2=0
                          __glibc=2.35=0
                          __linux=5.15.90.2=0
                          __unix=0=0
       base environment : /home/traversaro/mambaforge  (writable)
      conda av data dir : /home/traversaro/mambaforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/traversaro/mambaforge/pkgs
                          /home/traversaro/.conda/pkgs
       envs directories : /home/traversaro/mambaforge/envs
                          /home/traversaro/.conda/envs
               platform : linux-64
             user-agent : conda/23.3.1 requests/2.31.0 CPython/3.10.10 Linux/5.15.90.2-microsoft-standard-WSL2 ubuntu/22.04.2 glibc/2.35
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False
@traversaro traversaro added the bug label Aug 15, 2023
@jan-janssen
Copy link
Member

@isuruf Should we add a separate rocm built and have the default version build without rocm?

@traversaro
Copy link
Contributor Author

A side effect of this is that in some cases downstream projects are linking rocm_smi64 library, as the .pc file for nocuda builds is:

prefix=/home/traversaro/mambaforge/envs/libhwloc
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

Name: hwloc
Description: Hardware locality detection and management library
Version: 2.9.2
Requires.private: libxml-2.0
Cflags: -I${includedir}
Libs: -L${libdir} -lhwloc
Libs.private: -lm  -lrocm_smi64 -L/home/traversaro/mambaforge/envs/libhwloc/lib -lxml2    -lpthread

Anyhow, this .pc file is actually correct for rocm-enabled builds, the actual problem is conda-forge/conda-forge.github.io#1880, and the actual solution is to start using pkgconf in place of pkg-config.

@isuruf
Copy link
Member

isuruf commented Aug 15, 2023

Yeah, a separate build with rocm makes sense

@traversaro
Copy link
Contributor Author

fyi @fl-ferr

@traversaro
Copy link
Contributor Author

In some internal workflows with @fl-ferr started seeing errors like:

root@0b32781a2be8:/# python -m rl_zoo3.train --algo sac --env Pendulum-v1 --track
2023-08-29 13:47:39.614246: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-29 13:47:39.649210: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-29 13:47:39.649565: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-29 13:47:40.329235: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
========== Pendulum-v1 ==========
Seed: 2454984228
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
cat: /sys/module/amdgpu/initstate: No such file or directory
ERROR:root:Driver not initialized (amdgpu not found in modules)

That went away as soon as rocm-smi was uninstalled. I am not sure what actually triggered this error, but as anyhow the consensus was that a separate variant for rocm made sense, I implemented it in #66 .

@traversaro
Copy link
Contributor Author

Fixed by #66 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants