Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not enough SMs on RTX 2080 #117

Open
qub3s opened this issue Apr 25, 2024 · 4 comments
Open

Not enough SMs on RTX 2080 #117

qub3s opened this issue Apr 25, 2024 · 4 comments

Comments

@qub3s
Copy link

qub3s commented Apr 25, 2024

Hey

I know that you guys optimized this project for the A100, and i read that people got the 4090 and the 3090 running. I am only able to work with 2080s (University).

When i try to run your code (amg_example.py), im getting the following errors :

torch._inductor.utils: [WARNING] not enough SMs to use max_autotune_gemm mode

followed by a bunch of "code" and then:
BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Internal Triton PTX codegen error:
ptxas /tmp/compile-ptx-src-76618e, line 149; error : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-76618e, line 149; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
(.....)
ptxas /tmp/compile-ptx-src-76618e, line 200; error : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-76618e, line 200; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
ptxas fatal : Ptx assembly aborted due to error

Is it just a shortcoming of my hardware or is there anything i am doing wrong.

PS: the Original model runs fine and your project runs as well if i use "sam_model_registry" (i guess that is just the meta implementation)

Thank you.

@cpuhrsch
Copy link
Contributor

Hey @qub3s, in this case your GPU doesn't support bfloat16. You'd need to change the model to use float16 (potentially slightly worse accuracy) or use float32 (much slower). sm_80 here stands for the architecture version. The 20 series uses architecture Turing, which I think is sm75.

@qub3s
Copy link
Author

qub3s commented Apr 26, 2024

Thank you for the quick reponse, the model runs fine with float or float16.

The only thing that is strange is that the execution time seems to increase with the batch size when using float.

unit is ms per image at image size of ~500x500 pixel

float 32  - 10 images calculated
batchsize 1: 1908.6171875
batchsize 2: 1849.7623046875
batchsize 3: 4087.425390625
batchsize 4: not enough memory

float 16  - 10 images calculated
batchsize 1: 1040.30185546875
batchsize 2: 976.1693359375
batchsize 3: 996.29150390625
batchsize 4: 961.133984375
batchsize 5: 940.1287109375
batchsize 6: 996.61171875
batchsize 7: not enough memory

Anyway thanks again.

@cpuhrsch
Copy link
Contributor

@qub3s - Ah, yes. That's expected, because a bigger batch means you need to allocate more memory throughout model execution. Also see the sam_model_fast_registry (but you'd still need to switch to float16). But I think in general torch.compile might not help that much on RTX 2080. I'd be curious though.

@JamesHOEEEE
Copy link

Hi all

I was try to run amg_example.py on 2080TI too , I know the triton kernel is only support A100 ,so according the ReadMe file its need to set the environment variable SEGMENT_ANYTHING_FAST_USE_FLASH_4=0,

here is my code

import OS
os.environ[' SEGMENT_ANYTHING_FAST_USE_FLASH_4'] = '0'

but its still have miss the triton module error ,

Did I do something wrong? Or have any suggestions?

thanks you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants