Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Conda install does not consistently work #886

Open
ejmeitz opened this issue Nov 16, 2023 · 10 comments
Open

[BUG] Conda install does not consistently work #886

ejmeitz opened this issue Nov 16, 2023 · 10 comments

Comments

@ejmeitz
Copy link

ejmeitz commented Nov 16, 2023

Software versions

Python : 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
Platform : Linux-3.10.0-1160.99.1.el7.x86_64-x86_64-with-glibc2.17
Legion : (failed to detect)
Traceback (most recent call last):
File "/home/emeitz/software/anaconda3/bin/legate-issue", line 8, in
sys.exit(main())
^^^^^^
File "/home/emeitz/software/anaconda3/lib/python3.11/site-packages/legate/issue.py", line 79, in main
print(f"Legate : {try_version('legate', 'version')}")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emeitz/software/anaconda3/lib/python3.11/site-packages/legate/issue.py", line 32, in try_version
return getattr(module, attr) if module else None
^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'legate' has no attribute 'version'

Jupyter notebook / Jupyter Lab version

N/A

Expected behavior

I am trying to use cunumeric which requires legion.core to be installed first. I thought I had everything up and running, but after restarting my terminal the legate command was no longer on path (conda env was active). I went digging through the anaconda packages and found this package: legate-core-23.03.00-cuda11_py311_g5de57a8_3 which has a bin with legate and legate-issue binaries inside. I'm not exactly sure what is wrong so I will just list some things I found strage:

  • The name of this pkg is a little weird because my machine has CUDA toolkit 12.0 (with accompanying driver) installed and I was expecting the anaconda install to automatically pick up the symlink at /usr/local/cuda but it does not appear to have done that
  • The anaconda install would not come with GPU support (indicated by legate --info) until I installed some other libraries. I do not know which one exactly fixed this but I'm guessing it was NCCL.
  • Even if I manually add the path of this legate binary to my path I cannot run my cunumeric program which I could do before.

Observed behavior

  • Before adding the anaconda pkg manually to path legate would not register as a valid command
  • By adding it to path I get a weird message:

(base) [emeitz@gpu-node-1 pkgs]$ legate
/home/emeitz/software/anaconda3/pkgs/legate-core-23.03.00-cuda11_py311_g5de57a8_3/bin/legate: line 2: /opt/conda/conda-bld/legate-core_1678881258206/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/bin/python: No such file or directory

Example code or instructions

Not sure this is reproducible I'm just having issues getting things working as they should.

Stack traceback or browser console output

No response

@manopapad
Copy link
Contributor

Thank you for reporting this.

Something is very broken with this installation (legate not on $PATH, missing __version__ from legate module, the bin/python pointing to a conda-build temporary directory). I am also surprised to see that legate-issue is available. This was supposed to be included with the upcoming 23.11 packages, and apparently we jumped the gun by adding it to the bug report template before those were out.

Do you remember how you tried to install originally?

Could you share the current contents of your environment (conda list)?

Could you share the output of conda info? We're looking at the virtual packages in particular. That's where we can see what version of CUDA conda is detecting on your machine.

Could you try creating a new environment, requesting the latest available packages?

mamba create -n test -c nvidia -c conda-forge -c legate legate.core=23.09

Conda would ideally be picking the latest version of the legate.core package, but it might be having trouble fulfilling that version's dependencies, and ends up using an earlier version (whose dependencies are incomplete).

@ejmeitz
Copy link
Author

ejmeitz commented Nov 16, 2023

I originally tried using the install.py script and kept running into missing dependencies (rust, nccl, skbuild etc.). Eventually it did build and was having similar issues to the anaconda version now. I completely nuked this install and the anaconda env I installed it into. Also legate-issue still had that bug with __version__ missing when everything appeared to be working.

Here's the anaconda info, this was a completely new install of anaconda I did not have it before trying to install legate and cunumeric.

conda_info.txt
conda_list.txt

Anaconda could not find that specific version of legate-core. I am not using mamba, but I don think that matters?? It looks like the version I have installed was 23.11 though.

PackagesNotFoundError: The following packages are not available from current channels:

  • legate.core=23.09*

Current channels:

Also some more info, not sure if its relevant, but inside the anaconda3/bin if I type leg and just tab autocomplete the only thing in there is legate-issue. Its almost as if legate binary just disappeared. I'm not sure what happened there...

@manopapad
Copy link
Contributor

I think what happened is that your original from-source installation (using install.py) wasn't properly cleaned up, and by later doing a pre-built installation on top of it you are now in a very weird state. Moreover, it appears you installed the packages on the "base" environment, which is not very easy to clean up.

I would suggest that you remove your entire anaconda installation, do a fresh anaconda install, then create new a child environment containing cunumeric.

conda/mamba create -n myenv -c nvidia -c conda-forge -c legate cunumeric

Hopefully conda picks the latest version automatically (23.09), but if it doesn't you can try specifying it explicitly (cunumeric=23.09).

@ejmeitz
Copy link
Author

ejmeitz commented Nov 16, 2023

Sounds good, I will try that.

Do you have any idea why my original clean installation with anaconda (which was in a separate env) did not pick up GPU support? When I did legate --info it showed no cuda support but the docs say it should build with GPU support by default. This is the main reason I ended up in this messed up state as I need the GPU and multi node support. Do NCCL, UCX or other libraries need to be installed for the conda to pick up full support?

@manopapad
Copy link
Contributor

and multi node support.

In that case the pre-built conda packages will not help (they only support single-node execution). You will need to do a from-source install unfortunately.

I suggest following the "basic build" instructions from https://github.com/nv-legate/legate.core/blob/branch-24.01/BUILD.md#basic-build. The base environment that generate-conda-envs.py creates should contain all requirements for a successful build (that's probably what was missing in your original attempt).

Based on your machine, I suggest creating a base environment as follows:

./scripts/generate-conda-envs.py --python 3.10 --ctk 12.0 --os linux --no-compilers --no-openmpi --ucx

Assuming your machine already has a C++ compiler, and some MPI implementation.

@ejmeitz
Copy link
Author

ejmeitz commented Nov 16, 2023

So this build script just gives an environment with all the dependencies and I am supposed to install the legate anaconda library on top of this environment by running the install.py? Or can I use the anaconda install command to put legate into this env?

Also, the environment fully re-installs CUDA and UCX through anaconda even though I have them both installed. Is that fine? I dont see why it would be a huge problem, but I also am unfamiliar with the anaconda versions of these packages.

@manopapad
Copy link
Contributor

So this build script just gives an environment with all the dependencies and I am supposed to install the legate anaconda library on top of this environment by running the install.py? Or can I use the anaconda install command to put legate into this env?

The generate-conda-envs.py script creates an environment into which you can build from source (i.e. going through install.py).

If the pre-built cuNumeric conda package were sufficient for you (in your case it's not, because you want to do multi-node runs), then it should be sufficient to create a new environment containing just the pre-built package (using the conda create ... cunumeric command from the README), and conda should automatically pull in all dependencies.

Also, the environment fully re-installs CUDA

That should be fine, as long as you use the same version of CUDA in the conda environment as you have on your system. We use conda to pull some CUDA libraries that don't come standard in the CUDA SDK (e.g. cuTensor), and that unfortunately requires that we pull in a lot of other (potentially superfluous) dependencies.

and UCX

If you already have a version of UCX on your system that you want to reuse, then use an environment created with --no-ucx, and the legate installation should pick the system-wide UCX during build.

@ejmeitz
Copy link
Author

ejmeitz commented Nov 16, 2023

Thanks for all your help, I got legate.core built and everything looks good. I then cloned cunmeric and simply ran the install.py with the legate anaconda environment activated. I get a slew of errors that kind of look like some dependency is missing. Looks like it failed to link tblis. Any idea why this might happen.

  /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: lib/.libs/libtblis.so: undefined reference to `__atomic_compare_exchange_16'
  /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: lib/.libs/libtblis.so: undefined reference to `__atomic_store_16'
  /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: lib/.libs/libtblis.so: undefined reference to `__atomic_store'
  /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: lib/.libs/libtblis.so: undefined reference to `__atomic_load_16'
  /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: lib/.libs/libtblis.so: undefined reference to `__atomic_compare_exchange'
  /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: lib/.libs/libtblis.so: undefined reference to `__atomic_load'
  collect2: error: ld returned 1 exit status
  make[1]: *** [Makefile:1884: bin/test] Error 1
  make[1]: Leaving directory '/home/emeitz/software/cunumeric/_skbuild/linux-x86_64-3.11/cmake-build/_deps/tblis-src'

make: *** [Makefile:2581: install-recursive] Error 1
[1/142] Generate install_info.py
FAILED: _deps/tblis-build/lib/libtci.so _deps/tblis-build/lib/libtblis.so /home/emeitz/software/cunumeric/_skbuild/linux-x86_64-3.11/cmake-build/_deps/tblis-build/lib/libtci.so /home/emeitz/software/cunumeric/_skbuild/linux-x86_64-3.11/cmake-build/_deps/tblis-build/lib/libtblis.so

@ejmeitz
Copy link
Author

ejmeitz commented Nov 16, 2023

Never mind, turns out when you upgrade gcc on Cent-OS the devtoolset does not come with libatomic.
Had to install sudo yum install devtoolset-11-libatomic-devel. If I have other questions about usage of cunumeric is the issues section of that github the best place or is there some kind of forum?

@manopapad
Copy link
Contributor

Never mind, turns out when you upgrade gcc on Cent-OS the devtoolset does not come with libatomic.

I'm surprised tblis's configure step didn't complain earlier, but 🤷

If I have other questions about usage of cunumeric is the issues section of that github the best place or is there some kind of forum?

Github issues is the best place currently for build/run issues. If you'd like to discuss your overall usecase in more detail, feel free to email [email protected].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants