Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run on A100? #31

Open
ZSL98 opened this issue Mar 12, 2024 · 9 comments
Open

How to run on A100? #31

ZSL98 opened this issue Mar 12, 2024 · 9 comments
Assignees

Comments

@ZSL98
Copy link

ZSL98 commented Mar 12, 2024

Is the dockerfile in the latest_cuda_changes branch runnable on A100? It seems that the container built with the dockerfile in the main branch has some problems running your AE fig7 script 'python run_orion.py', reporting the error when I run it on my A100 machine:

CUDA Runtime Error at: intercept_temp.cpp:453
Error 209, no kernel image is available for execution on the device
@ZSL98
Copy link
Author

ZSL98 commented Mar 12, 2024

I have tried that dockerfile, the torch patch seems not compatible with the pytorch version 7bcf7da3a268b435777fe87c7794c382f444e86d

@ZSL98
Copy link
Author

ZSL98 commented Mar 12, 2024

Can you provide the patch for a newer pytorch version? That would be helpful. Thanks!

@fotstrt
Copy link
Contributor

fotstrt commented Mar 12, 2024

Hello, thank you for your comment! No, the dockerfile is not ready yet. We are working on open-sourcing a version of Orion compatible with A100 GPUs. The AE fig7 was run on a V100 GPU. I expect the version for A100 GPUs (supporting cuda versions >10.2) will be out in the next few weeks.

@kzos
Copy link

kzos commented Apr 23, 2024

I too got same error on any other GPU's other than v100's maybe.

Tried it on 3070 and a100, both same error, (no kernel image available)

CUDA Runtime Error at: intercept_temp.cpp:453
Error 209, no kernel image is available for execution on the device
python3.8: intercept_temp.h:805: void check(T, const char*, const char*, int) [with T = cudaError]: Assertion `err == cudaSuccess' failed.
Aborted (core dumped)


kindly @fotstrt please advise whether non docker path works for 3070 and A100, ?

TIA.

@fotstrt
Copy link
Contributor

fotstrt commented Apr 24, 2024

This will be addressed in the following 2 weeks. Thank you!

@jiashu-z
Copy link

jiashu-z commented Jun 4, 2024

This will be addressed in the following 2 weeks. Thank you!

Hi, I wonder if this is already addressed. I would like to try Orion on CUDA 12.1. Would you please point me to the correct branch? Is it fot/latest_cuda_changes? Thanks!

@fotstrt
Copy link
Contributor

fotstrt commented Jun 4, 2024

Hello, there has been a delay, sorry about that.

The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged).

I plan to do more tests and merge soon.

@jzxycsjzy
Copy link

Hello, there has been a delay, sorry about that.

The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged).

I plan to do more tests and merge soon.

Thanks for your reply. I tried this Dockerfile to build a CUDA12.1 version image, but it reports many errors. I create the container following the guidance in INSTALL.md bud it does not work.

@jzxycsjzy
Copy link

Hello, there has been a delay, sorry about that.
The branch fot/latest_cuda_changes contains a Dockerfile: https://github.com/eth-easl/orion/blob/fot/latest_cuda_changes/setup/Dockerfile_Cuda12 where i have tested some basic Orion functionality, but not fully tested the system yet (that's why it is not merged).
I plan to do more tests and merge soon.

Thanks for your reply. I tried this Dockerfile to build a CUDA12.1 version image, but it reports many errors. I create the container following the guidance in INSTALL.md bud it does not work.

Moreover, the reason for these errors is that cuDNN lib is not linked or installed correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants