Issues on running H2O benchmark without FlashAttention2 #3

fantasysee · 2024-10-22T23:11:17Z

Thank you for sharing this benchmark. I'm attempting to reproduce the H2O benchmark results on non-Ampere GPUs without FlashAttention2 support.

I noticed that FlashAttention2 is mandatory in the current implementation. To run without it, I made the following modifications in inference.py:

Imported standard attention module instead of FlashAttention2 (Line 8-9)
Replaced FlashAttention2 with standard attention implementation (Line 15-17)
Removed FlashAttention2 assertion check (Line 24)
Changed attention type from "flash_attention_2" to "eager" (Line 28)
Added .lower() to all pipeline_config['model_name'] checks (Line 78-82) [This one is important for the code compatibility.]

However, when running the benchmark on the narrativeqa dataset, I'm getting a qa_f1_score of 0.0, which seems incorrect.

Could you please provide guidance on: 1) The proper way to run H2O benchmark without FlashAttention2; 2) If there are any additional modifications needed to ensure correct evaluation.

Many thanks!!!

fantasysee · 2024-10-22T23:19:02Z

FYI, Attached is my modification: inference.txt

Regards!

fantasysee · 2024-10-23T00:05:38Z

Ooops... This one can be successfully executed w/o flash attention on llama3-8b-instruct!

inference_wo_flash_attention.txt

However, when I try to run mistral-7b-instruct-v0.2, it raises a significant issues:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3786.94 GiB. GPU 0 has a total capacity of 79.25 GiB of which 54.09 GiB is free. Including non-PyTorch memory, this process has 25.15 GiB memory in use. Of the allocated memory 21.11 GiB is allocated by PyTorch, and 3.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I didn't encounter this issue when enabling flash attention on mistral-7b-instruct-v0.2. Could you please have a look at what might raise the issue?

fantasysee · 2024-10-24T23:11:02Z

It seems that the current mistral-7b-instruct-v0.2 only supports the flash attention version due to the model definition being forced to be flash attention. Why?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues on running H2O benchmark without FlashAttention2 #3

Issues on running H2O benchmark without FlashAttention2 #3

fantasysee commented Oct 22, 2024

fantasysee commented Oct 22, 2024

fantasysee commented Oct 23, 2024

fantasysee commented Oct 24, 2024

Issues on running H2O benchmark without FlashAttention2 #3

Issues on running H2O benchmark without FlashAttention2 #3

Comments

fantasysee commented Oct 22, 2024

fantasysee commented Oct 22, 2024

fantasysee commented Oct 23, 2024

fantasysee commented Oct 24, 2024