Why do the comparative experiments for I2V not use the same initial frame? #19

Alex-ui1 · 2024-08-30T07:49:02Z

It seems that FancyVideo is an I2V model, and in Figure 4, it did not use the same initial frame when comparing with other I2V models. Wouldn't this be considered somewhat unfair?

MaAo · 2024-09-01T14:46:31Z

It seems that FancyVideo is an I2V model, and in Figure 4, it did not use the same initial frame when comparing with other I2V models. Wouldn't this be considered somewhat unfair?

Hello, thank you for your interest in our work. That's a great question.
Our approach can be understood as progressing from Text to Image to Video, with the Text component being optional, resulting in an Image to Video workflow.
Not all methods illustrated in Figure 4 support the Image-to-Video transition (e.g., AnimateDiff). Thus, we opted to generate the video using the same Text input to ensure a consistent basis for comparison.

Alex-ui1 · 2024-09-01T16:49:21Z

Thank you very much for your thoughtful and patient response! I greatly appreciate it. However, I would like to mention that, according to the paper, FancyVideo utilizes images generated by SD XL as the first frame. This approach evidently enhances image quality and also influences Text-Video Alignment. In my view, this is a significant factor contributing to FancyVideo's advantages in these two metrics.

I wonder if it might be perceived as somewhat unfair to compare it quantitatively and qualitatively with other I2V methods (such as DynamiCrafter, Gen2, and Pika) without using the same initial frame. Additionally, as far as I understand, animatediff can achieve the I2V task by incorporating SparseCtrl. Thank you for considering my perspective!

MaAo · 2024-09-02T01:07:23Z

Thank you very much for your thoughtful and patient response! I greatly appreciate it. However, I would like to mention that, according to the paper, FancyVideo utilizes images generated by SD XL as the first frame. This approach evidently enhances image quality and also influences Text-Video Alignment. In my view, this is a significant factor contributing to FancyVideo's advantages in these two metrics.

I wonder if it might be perceived as somewhat unfair to compare it quantitatively and qualitatively with other I2V methods (such as DynamiCrafter, Gen2, and Pika) without using the same initial frame. Additionally, as far as I understand, animatediff can achieve the I2V task by incorporating SparseCtrl. Thank you for considering my perspective!

Hello,

I am very pleased to receive your response and look forward to discussing this matter further.

Our experiments have demonstrated that higher quality of the initial frame results in improved video quality, motion, and consistency. This is why we used SDXL to generate the first frame in our paper. Currently, we recommend using a more advanced Text-to-Image (T2I) model, such as FLUX, to create the first frame, followed by I2V processing with FancyVideo.
In our paper, the I2V experiments are based on identical first-frame inputs generated by SDXL. Due to space constraints, we included only the metrics from manual evaluations (see Figure 5) and did not include qualitative image demonstrations in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do the comparative experiments for I2V not use the same initial frame? #19

Why do the comparative experiments for I2V not use the same initial frame? #19

Alex-ui1 commented Aug 30, 2024

MaAo commented Sep 1, 2024 •

edited

Loading

Alex-ui1 commented Sep 1, 2024

MaAo commented Sep 2, 2024 •

edited

Loading

Why do the comparative experiments for I2V not use the same initial frame? #19

Why do the comparative experiments for I2V not use the same initial frame? #19

Comments

Alex-ui1 commented Aug 30, 2024

MaAo commented Sep 1, 2024 • edited Loading

Alex-ui1 commented Sep 1, 2024

MaAo commented Sep 2, 2024 • edited Loading

MaAo commented Sep 1, 2024 •

edited

Loading

MaAo commented Sep 2, 2024 •

edited

Loading