You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for sharing this benchmark. I'm attempting to reproduce the H2O benchmark results on non-Ampere GPUs without FlashAttention2 support.
I noticed that FlashAttention2 is mandatory in the current implementation. To run without it, I made the following modifications in inference.py:
Imported standard attention module instead of FlashAttention2 (Line 8-9)
Replaced FlashAttention2 with standard attention implementation (Line 15-17)
Removed FlashAttention2 assertion check (Line 24)
Changed attention type from "flash_attention_2" to "eager" (Line 28)
Added .lower() to all pipeline_config['model_name'] checks (Line 78-82) [This one is important for the code compatibility.]
However, when running the benchmark on the narrativeqa dataset, I'm getting a qa_f1_score of 0.0, which seems incorrect.
Could you please provide guidance on: 1) The proper way to run H2O benchmark without FlashAttention2; 2) If there are any additional modifications needed to ensure correct evaluation.
Many thanks!!!
The text was updated successfully, but these errors were encountered:
However, when I try to run mistral-7b-instruct-v0.2, it raises a significant issues:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3786.94 GiB. GPU 0 has a total capacity of 79.25 GiB of which 54.09 GiB is free. Including non-PyTorch memory, this process has 25.15 GiB memory in use. Of the allocated memory 21.11 GiB is allocated by PyTorch, and 3.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
I didn't encounter this issue when enabling flash attention on mistral-7b-instruct-v0.2. Could you please have a look at what might raise the issue?
It seems that the current mistral-7b-instruct-v0.2 only supports the flash attention version due to the model definition being forced to be flash attention. Why?
Thank you for sharing this benchmark. I'm attempting to reproduce the H2O benchmark results on non-Ampere GPUs without FlashAttention2 support.
I noticed that FlashAttention2 is mandatory in the current implementation. To run without it, I made the following modifications in inference.py:
However, when running the benchmark on the narrativeqa dataset, I'm getting a qa_f1_score of 0.0, which seems incorrect.
Could you please provide guidance on: 1) The proper way to run H2O benchmark without FlashAttention2; 2) If there are any additional modifications needed to ensure correct evaluation.
Many thanks!!!
The text was updated successfully, but these errors were encountered: