Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do the comparative experiments for I2V not use the same initial frame? #19

Open
Alex-ui1 opened this issue Aug 30, 2024 · 3 comments

Comments

@Alex-ui1
Copy link

It seems that FancyVideo is an I2V model, and in Figure 4, it did not use the same initial frame when comparing with other I2V models. Wouldn't this be considered somewhat unfair?

@MaAo
Copy link
Collaborator

MaAo commented Sep 1, 2024

It seems that FancyVideo is an I2V model, and in Figure 4, it did not use the same initial frame when comparing with other I2V models. Wouldn't this be considered somewhat unfair?

Hello, thank you for your interest in our work. That's a great question.
Our approach can be understood as progressing from Text to Image to Video, with the Text component being optional, resulting in an Image to Video workflow.
Not all methods illustrated in Figure 4 support the Image-to-Video transition (e.g., AnimateDiff). Thus, we opted to generate the video using the same Text input to ensure a consistent basis for comparison.

@Alex-ui1
Copy link
Author

Alex-ui1 commented Sep 1, 2024

Thank you very much for your thoughtful and patient response! I greatly appreciate it. However, I would like to mention that, according to the paper, FancyVideo utilizes images generated by SD XL as the first frame. This approach evidently enhances image quality and also influences Text-Video Alignment. In my view, this is a significant factor contributing to FancyVideo's advantages in these two metrics.

I wonder if it might be perceived as somewhat unfair to compare it quantitatively and qualitatively with other I2V methods (such as DynamiCrafter, Gen2, and Pika) without using the same initial frame. Additionally, as far as I understand, animatediff can achieve the I2V task by incorporating SparseCtrl. Thank you for considering my perspective!

@MaAo
Copy link
Collaborator

MaAo commented Sep 2, 2024

Thank you very much for your thoughtful and patient response! I greatly appreciate it. However, I would like to mention that, according to the paper, FancyVideo utilizes images generated by SD XL as the first frame. This approach evidently enhances image quality and also influences Text-Video Alignment. In my view, this is a significant factor contributing to FancyVideo's advantages in these two metrics.

I wonder if it might be perceived as somewhat unfair to compare it quantitatively and qualitatively with other I2V methods (such as DynamiCrafter, Gen2, and Pika) without using the same initial frame. Additionally, as far as I understand, animatediff can achieve the I2V task by incorporating SparseCtrl. Thank you for considering my perspective!

Hello,

I am very pleased to receive your response and look forward to discussing this matter further.

  1. Our experiments have demonstrated that higher quality of the initial frame results in improved video quality, motion, and consistency. This is why we used SDXL to generate the first frame in our paper. Currently, we recommend using a more advanced Text-to-Image (T2I) model, such as FLUX, to create the first frame, followed by I2V processing with FancyVideo.

  2. In our paper, the I2V experiments are based on identical first-frame inputs generated by SDXL. Due to space constraints, we included only the metrics from manual evaluations (see Figure 5) and did not include qualitative image demonstrations in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants