Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues on running H2O benchmark without FlashAttention2 #3

Open
fantasysee opened this issue Oct 22, 2024 · 3 comments
Open

Issues on running H2O benchmark without FlashAttention2 #3

fantasysee opened this issue Oct 22, 2024 · 3 comments

Comments

@fantasysee
Copy link

Thank you for sharing this benchmark. I'm attempting to reproduce the H2O benchmark results on non-Ampere GPUs without FlashAttention2 support.

I noticed that FlashAttention2 is mandatory in the current implementation. To run without it, I made the following modifications in inference.py:

  • Imported standard attention module instead of FlashAttention2 (Line 8-9)
  • Replaced FlashAttention2 with standard attention implementation (Line 15-17)
  • Removed FlashAttention2 assertion check (Line 24)
  • Changed attention type from "flash_attention_2" to "eager" (Line 28)
  • Added .lower() to all pipeline_config['model_name'] checks (Line 78-82) [This one is important for the code compatibility.]

However, when running the benchmark on the narrativeqa dataset, I'm getting a qa_f1_score of 0.0, which seems incorrect.

Could you please provide guidance on: 1) The proper way to run H2O benchmark without FlashAttention2; 2) If there are any additional modifications needed to ensure correct evaluation.

Many thanks!!!

@fantasysee
Copy link
Author

FYI, Attached is my modification: inference.txt

Regards!

@fantasysee
Copy link
Author

Ooops... This one can be successfully executed w/o flash attention on llama3-8b-instruct!

inference_wo_flash_attention.txt

However, when I try to run mistral-7b-instruct-v0.2, it raises a significant issues:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3786.94 GiB. GPU 0 has a total capacity of 79.25 GiB of which 54.09 GiB is free. Including non-PyTorch memory, this process has 25.15 GiB memory in use. Of the allocated memory 21.11 GiB is allocated by PyTorch, and 3.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I didn't encounter this issue when enabling flash attention on mistral-7b-instruct-v0.2. Could you please have a look at what might raise the issue?

@fantasysee
Copy link
Author

It seems that the current mistral-7b-instruct-v0.2 only supports the flash attention version due to the model definition being forced to be flash attention. Why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant