Error for video SFT #14

yanlai00 · 2024-11-06T20:02:06Z

I got the error "AttributeError: 'VisionCrossAttentionLayer' object has no attribute 'pos_embed_0'" when doing supervised fine-tuning on a subset of the video dataset you provided, with the LongVU_Llama3_2_3B_img checkpoint, with 1 node of 8 GPUs. Any insights on how to resolve this is appreciated. Thanks.

xiaoqian-shen · 2024-11-07T06:07:53Z

Hi, you need to change the token from 576 to 144 as written in README here. Since we use maximum 144 tokens to represent each video frame.

zzzz123-0708 · 2024-11-07T07:28:21Z

Hi, you need to change the token from 576 to 144 as written in README here. Since we use maximum 144 tokens to represent each video frame.
The default config for the model I downloaded is 144. Do I need to change it to 576 when performing inference？

xiaoqian-shen · 2024-11-07T11:52:43Z

For video model, please set as 144 for training and inference, while for image model, please set as 576.

yanlai00 · 2024-11-20T19:28:01Z

Thanks for your response! I noticed that in the provided llama-3.2-1B video model, the token length is set to 576. Is this intentional?

xiaoqian-shen · 2024-11-21T20:53:51Z

@yanlai00 Thanks for raising this concern. It should be 144. We have corrected it in LongVU_Llama3_2_1B.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error for video SFT #14

Error for video SFT #14

yanlai00 commented Nov 6, 2024

xiaoqian-shen commented Nov 7, 2024

zzzz123-0708 commented Nov 7, 2024

xiaoqian-shen commented Nov 7, 2024

yanlai00 commented Nov 20, 2024

xiaoqian-shen commented Nov 21, 2024

Error for video SFT #14

Error for video SFT #14

Comments

yanlai00 commented Nov 6, 2024

xiaoqian-shen commented Nov 7, 2024

zzzz123-0708 commented Nov 7, 2024

xiaoqian-shen commented Nov 7, 2024

yanlai00 commented Nov 20, 2024

xiaoqian-shen commented Nov 21, 2024