Text-to-video-generation-model

Damo-vilab Text-to-Videos model leverages a sophisticated multi-stage diffusion architecture to transform textual descriptions into corresponding video sequences.

Exclusively designed for English input, the model comprises three integral sub-networks:

Text feature extraction model
Text feature-to-video latent space diffusion model
Video latent space to video visual space model

The model employs a UNet3D structure for its diffusion process, generating videos iteratively through denoising from pure Gaussian noise video. Also finds application across diverse scenarios, offering the ability to reason and generate videos based on arbitrary English text descriptions.

The generated output is made accessible through the provision of a save path for the resulting video file. Playback is facilitated through VLC media player, with the output saved in an MP4 format. Note that while VLC seamlessly supports playback, some other media players may not exhibit optimal performance for viewing the generated content.

TEXT FEATURE EXTRACTION MODEL

The initial stage employs advanced natural language processing (NLP) techniques to meticulously extract nuanced features from the input text, laying a robust foundation for subsequent video generation

LATENT SPACE DIFFFUSION MODEL

The model's core strength lies in its multi-stage diffusion process, a unique approach that iteratively transforms text features into videos by denoising from pure Gaussian noise video. The incorporation of a UNet3D structure adds an extra layer of efficiency to this diffusion process.

VIDEO VISUAL SPACING MODEL

The final stage elegantly translates the refined latent space into the video visual space, resulting in a coherent and visually compelling output that faithfully reflects the essence of the input text.

GENERATION FROM VIDEOS TO TEXT

The model tackles the intricate task of describing videos by proposing a multi-step approach that synergizes Natural Language Processing (NLP) and Computer Vision (CV) components, exemplifying a holistic understanding of both domains.

NLP PART: The user-provided text undergoes a meticulous journey, including segmentation into sentences, entity extraction, and engagement with a Named Entity Module, showcasing a nuanced understanding of textual nuances.
CV PART: The Computer Vision (CV) part encompasses the complete spectrum, from collecting text for video generation to model selection (e.g., CRAFT, TFGAN, GODIVA), dataset division, training, testing, and optimization. This approach ensures meaningful and contextually rich video creation.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Images		Images
README.md		README.md
text_to_video-2.ipynb		text_to_video-2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-video-generation-model

TEXT FEATURE EXTRACTION MODEL

LATENT SPACE DIFFFUSION MODEL

VIDEO VISUAL SPACING MODEL

GENERATION FROM VIDEOS TO TEXT

About

Releases

Packages

Languages

apekkshaa/text-to-video-generation-model

Folders and files

Latest commit

History

Repository files navigation

Text-to-video-generation-model

TEXT FEATURE EXTRACTION MODEL

LATENT SPACE DIFFFUSION MODEL

VIDEO VISUAL SPACING MODEL

GENERATION FROM VIDEOS TO TEXT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages