Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long videos inference error #17

Open
cmhhw opened this issue Nov 8, 2024 · 14 comments
Open

long videos inference error #17

cmhhw opened this issue Nov 8, 2024 · 14 comments

Comments

@cmhhw
Copy link

cmhhw commented Nov 8, 2024

Hello, when loading the model for inference, it was found that the inference results on short videos meet expectations, but on long videos (the example provided in the project), the inference results produce a series of special characters.

this is my example

image

@cmhhw
Copy link
Author

cmhhw commented Nov 8, 2024

when run inference.py,the output is !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

why?

@xiaoqian-shen
Copy link
Collaborator

xiaoqian-shen commented Nov 10, 2024

Hi, @cmhhw I have locally tested the model on a A100 80G and it gives the accurate output.
image

  1. There is no inference.py in the repo. Are you referring inference.py to the code in README?
  2. Could you double check the conda environment.
  3. Is there any memory overflow issue.
  4. Are you using the checkpoint we provided LongVU_Qwen2_7B.

@beatriceadel
Copy link

beatriceadel commented Nov 11, 2024

image
I am also facing an error when using the inference code in the README. I tried to do inference on some short videos asking LongVU to describe them, but these show up instead of the actual video description. In most cases, only '!' was outputted as the video description. Could you give me some insights on why this may have happened?

@cmhhw
Copy link
Author

cmhhw commented Nov 11, 2024

Hi, @cmhhw I have locally tested the model on a A100 80G and it gives the accurate output. image

  1. There is no inference.py in the repo. Are you referring inference.py to the code in README?
  2. Could you double check the conda environment.
  3. Is there any memory overflow issue.
  4. Are you using the checkpoint we provided LongVU_Qwen2_7B.

Hi, @cmhhw I have locally tested the model on a A100 80G and it gives the accurate output. image

  1. There is no inference.py in the repo. Are you referring inference.py to the code in README?
  2. Could you double check the conda environment.
  3. Is there any memory overflow issue.
  4. Are you using the checkpoint we provided LongVU_Qwen2_7B.

yes,

Hi, @cmhhw I have locally tested the model on a A100 80G and it gives the accurate output. image

  1. There is no inference.py in the repo. Are you referring inference.py to the code in README?
  2. Could you double check the conda environment.
  3. Is there any memory overflow issue.
  4. Are you using the checkpoint we provided LongVU_Qwen2_7B.

Yes, I use the code provided in "click for quick inference code," and the parameters loaded are LongVU_Qwen2_7B. I still encounter this issue. My environment is: torch=2.5.0, python=3.10.15, and the CUDA version is 12.4. Additionally, I conduct the test on an A100. At the same time, since my GPU has 40GB of VRAM, when executing inference, I noticed that the model is automatically being inferred across multiple GPUs. What should I do to avoid this ?

@xiaoqian-shen
Copy link
Collaborator

Hi, @cmhhw, I am using torch==2.1.2 as shown in the conda env requirements.txt. You can set model.to('cuda:0').

@xiaoqian-shen
Copy link
Collaborator

@beatriceadel are you using the correct tokenizer as we provided in LongVU_Qwen2_7B?

@cmhhw
Copy link
Author

cmhhw commented Nov 15, 2024 via email

@beatriceadel
Copy link

@beatriceadel are you using the correct tokenizer as we provided in LongVU_Qwen2_7B?

Yes, i double checked and i did use the tokenizer provided in hf

@xiaoqian-shen
Copy link
Collaborator

Thank you very much. Following your suggestions, I have resolved the issue of garbled output in long videos. However, I encountered some problems while reading the "quick inference code." Here is the code: vr = VideoReader(video_path, ctx=cpu(0), num_threads=1) fps = float(vr.get_avg_fps()) frame_indices = np.array([i for i inrange(0, len(vr), round(fps))]) video = [] for frame_index in frame_indices: img = vr[frame_index].asnumpy() video.append(img) The above code retrieves certain frame indices from the original video and appends them to the video list, which is then used as input to the model. I am confused about how this method can help the model understand the entire video. Doesn't this way lose a lot of information? At 2024-11-13 00:33:15, "XiaoqianShen" @.> wrote: Hi, @cmhhw, I am using torch==2.1.2 as shown in the conda env requirements.txt. You can set model.to('cuda:0'). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.>

In this code, we sampled the video at 1fps, which is already dense sampling compare with most of the previous baselines limited to uniformly sample ~64 frames.

@tcm03
Copy link

tcm03 commented Nov 16, 2024

Thank you very much. Following your suggestions, I have resolved the issue of garbled output in long videos. However, I encountered some problems while reading the "quick inference code." Here is the code: vr = VideoReader(video_path, ctx=cpu(0), num_threads=1) fps = float(vr.get_avg_fps()) frame_indices = np.array([i for i inrange(0, len(vr), round(fps))]) video = [] for frame_index in frame_indices: img = vr[frame_index].asnumpy() video.append(img) The above code retrieves certain frame indices from the original video and appends them to the video list, which is then used as input to the model. I am confused about how this method can help the model understand the entire video. Doesn't this way lose a lot of information? At 2024-11-13 00:33:15, "XiaoqianShen" @.> wrote: Hi, @cmhhw, I am using torch==2.1.2 as shown in the conda env requirements.txt. You can set model.to('cuda:0'). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.>

Hi. I'm having the same problem about the model outputting a bunch of exclamations as you after I installed a newer version of pytorch than in the requirements.txt. I tried using the exact original requirements.txt (torch==2.1.2) but encountered an error about the method register_fake() not in torch.library

UPDATE
I tried torch==2.1.2 with torchvision==0.16.2 and successfully avoided the error above. However, as I run this command to inference the model:

data_path = '/kaggle/input/entube/EnTube'
model_path = './checkpoints/longvu_qwen'
model_name = 'cambrian_qwen'
version = 'qwen'

!python -m EnTube.eval --data_path $data_path --model_path $model_path --model_name $model_name --version $version

the model still outputs !!!!!...
The only modification (due to GPU RAM limit) to the code in inference.py is the load_8bit=True argument in load_pretrained_model():

tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path, None, model_name, load_8bit=True
    )

@xiaoqian-shen
Copy link
Collaborator

Ok, I see. We have never tested with 8bit. Maybe you need to change to a GPU with larger VRAM and inference with float16.

@yanlai00
Copy link

Had the same issue where the model only produces exclamation marks on the provided llama-3B checkpoint, but things work fine on the provided llama-1B checkpoint.

@yanlai00
Copy link

yanlai00 commented Nov 20, 2024

Actually doing inference with image_token_len=576 seems to resolve the issue on the provided llama-3B checkpoint. I would like to hear more from the authors on how to set image_token_len (and other related config params) at training and inference.

Update: the eval numbers are still much worse than the ones reported in the paper though. I was only able to get 37.85 on MLVU for llama-3B (instead of 55.9 in the paper). It's capability on long videos still seems limited.

@xiaoqian-shen
Copy link
Collaborator

@yanlai00 You should set image_token_len=144 for the video model. And please make sure you are using the video model LongVU_Llama3_2_3B instead of LongVU_Llama3_2_3B_img for video inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants