-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error for video SFT #14
Comments
Hi, you need to change the token from 576 to 144 as written in README here. Since we use maximum 144 tokens to represent each video frame. |
|
For video model, please set as 144 for training and inference, while for image model, please set as 576. |
Thanks for your response! I noticed that in the provided llama-3.2-1B video model, the token length is set to 576. Is this intentional? |
@yanlai00 Thanks for raising this concern. It should be 144. We have corrected it in LongVU_Llama3_2_1B. |
I got the error "AttributeError: 'VisionCrossAttentionLayer' object has no attribute 'pos_embed_0'" when doing supervised fine-tuning on a subset of the video dataset you provided, with the LongVU_Llama3_2_3B_img checkpoint, with 1 node of 8 GPUs. Any insights on how to resolve this is appreciated. Thanks.
The text was updated successfully, but these errors were encountered: