-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prone to OOM when performing multi-frame inference #88
Comments
|
Thanks for your reply!
after print(model)
Does this suggest FlashAttention is activated or not? |
@voidchant It seems that the ViT model isn’t using flash attention based on the output. Could you try the following code instead? model_id_or_path = "rhymes-ai/Aria"
revision = "4844f0b5ff678e768236889df5accbe4967ec845"
model = AutoModelForCausalLM.from_pretrained(
model_id_or_path,
revision=revision,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="flash_attention_2"
)
processor = AutoProcessor.from_pretrained(
model_id_or_path,
revision=revision,
trust_remote_code=True
) This code uses a slightly older version of the Aria model. The official Transformers repo has recently added support for the Aria model, but it doesn’t yet support flash attention. As a workaround, we can roll back to this older version for now. |
Executing the code on Hugging Face with an 8xA100(40G) GPU configuration, but i can only process up to ~4 frames or it will OOM
The text was updated successfully, but these errors were encountered: