We design and implement Open-Sora, an initiative dedicated to efficiently producing high-quality video. We hope to make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation. With Open-Sora, our goal is to foster innovation, creativity, and inclusivity within the field of content creation.
- [2024.04.25] 🤗 We released the Gradio demo for Open-Sora on Hugging Face Spaces.
- [2024.04.25] 🔥 We released Open-Sora 1.1, which supports 2s~15s, 144p to 720p, any aspect ratio text-to-image, text-to-video, image-to-video, video-to-video, infinite time generation. In addition, a full video processing pipeline is released. [checkpoints] [report]
- [2024.03.18] We released Open-Sora 1.0, a fully open-source project for video generation. Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with acceleration, inference, and more. Our model can produce 2s 512x512 videos with only 3 days training. [checkpoints] [blog] [report]
- [2024.03.04] Open-Sora provides training with 46% cost reduction. [blog]
🔥 You can experience Open-Sora on our 🤗 Gradio application on Hugging Face. More samples are available in our Gallery.
2s 240×426 | 2s 240×426 |
---|---|
2s 426×240 | 4s 480×854 |
---|---|
16s 320×320 | 16s 224×448 | 2s 426×240 |
---|---|---|
OpenSora 1.0 Demo
Videos are downsampled to .gif
for display. Click for original videos. Prompts are trimmed for display,
see here for full prompts.
- 📍 Open-Sora 1.1 released. Model weights are available here. It is trained on 0s~15s, 144p to 720p, various aspect ratios videos. See our report 1.1 for more discussions.
- 🔧 Data processing pipeline v1.1 is released. An automatic processing pipeline from raw videos to (text, video clip) pairs is provided, including scene cutting
$\rightarrow$ filtering(aesthetic, optical flow, OCR, etc.)$\rightarrow$ captioning$\rightarrow$ managing. With this tool, you can easily build your video dataset. - ✅ Improved ST-DiT architecture includes rope positional encoding, qk norm, longer text length, etc.
- ✅ Support training with any resolution, aspect ratio, and duration (including images).
- ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
- 📍 Open-Sora 1.0 released. Model weights are available here. With only 400K video clips and 200 H800 days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos. See our report 1.0 for more discussions.
- ✅ Three-stage training from an image diffusion model to a video diffusion model. We provide the weights for each stage.
- ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improves 55% training speed when training on 64x512x512 videos. Details locates at acceleration.md.
- 🔧 Data preprocessing pipeline v1.0, including downloading, video cutting, and captioning tools. Our data collection plan can be found at datasets.md.
View more
- ✅ We find VQ-VAE from VideoGPT has a low quality and thus adopt a better VAE from Stability-AI. We also find patching in the time dimension deteriorates the quality. See our report for more discussions.
- ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our STDiT achieves a better trade-off between quality and speed. See our report for more discussions.
- ✅ Support clip and T5 text conditioning.
- ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See commands.md for more instructions.
- ✅ Support inference with official weights from DiT, Latte, and PixArt.
- ✅ Refactor the codebase. See structure.md to learn the project structure and how to use the config files.
- Training Video-VAE and adapt our model to new VAE. [WIP]
- Scaling model parameters and dataset size. [WIP]
- Incoporate a better scheduler, e.g., rectified flow in SD3. [WIP]
View more
- Evaluation pipeline.
- Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, etc.). See the dataset for more information
- Support image and video conditioning.
- Support variable aspect ratios, resolutions, durations.
- Installation
- Model Weights
- Inference
- Data Processing
- Training
- Evaluation
- Contribution
- Acknowledgement
Other useful documents and links are listed below.
- Report: report 1.1, report 1.0, acceleration.md
- Repo structure: structure.md
- Config file explanation: config.md
- Useful commands: commands.md
- Data processing pipeline and dataset: datasets.md
- Each data processing tool's README: dataset conventions and management, scene cutting, scoring, caption
- Evaluation: eval
- Gallery: gallery
# create a virtual env
conda create -n opensora python=3.10
# activate virtual environment
conda activate opensora
# install torch
# the command below is for CUDA 12.1, choose install commands from
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip install torch torchvision
# install flash attention (optional)
# set enable_flashattn=False in config to avoid using flash attention
pip install packaging ninja
pip install flash-attn --no-build-isolation
# install apex (optional)
# set enable_layernorm_kernel=False in config to avoid using apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
# install xformers
pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .
Run the following command to build a docker image from Dockerfile provided.
docker build -t opensora ./docker
Run the following command to start the docker container in interactive mode.
docker run -ti --gpus all -v {MOUNT_DIR}:/data opensora
- create a virtual env
# create a virtual env
conda create -n opensora python=3.10
# activate virtual environment
conda activate opensora
# install torch
pip install torch==2.1.0 torchvision==0.16.0
- create NPU env
Please refer to 《Pytorch框架训练环境准备》。
# activate cann env
source ${cann_install_path}/ascend-toolkit/set_env.sh
# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .
Run the following command to build a docker image from Dockerfile provided.
docker build -t opensora ./docker
Run the following command to start the docker container in interactive mode.
docker run -ti --gpus all -v {MOUNT_DIR}:/data opensora
Resolution | Model Size | Data | #iterations | Batch Size | URL |
---|---|---|---|---|---|
mainly 144p & 240p | 700M | 10M videos + 2M images | 100k | dynamic | 🔗 |
144p to 720p | 700M | 500K HQ videos + 1M images | 4k | dynamic | 🔗 |
See our report 1.1 for more infomation.
View more
| Resolution | Model Size | Data | #iterations | Batch Size | GPU days (H800) | URL | | ---------- | ---------- | ------ | ----------- | ---------- | --------------- | | 16×512×512 | 700M | 20K HQ | 20k | 2×64 | 35 | 🔗 | | 16×256×256 | 700M | 20K HQ | 24k | 8×64 | 45 | 🔗 | | 16×256×256 | 700M | 366K | 80k | 8×64 | 117 | 🔗 |
Training orders: 16x256x256
Our model's weight is partially initialized from PixArt-α. The number of parameters is 724M. More information about training can be found in our report. More about the dataset can be found in datasets.md. HQ means high quality.
🔥 You can experience Open-Sora on our 🤗 Gradio application on Hugging Face online.
If you want to deploy gradio locally, we have also provided a Gradio application in this repository, you can use the following the command to start an interactive web application to experience video generation with Open-Sora.
pip install gradio spaces
python gradio/app.py
This will launch a Gradio application on your localhost. If you want to know more about the Gradio applicaiton, you can refer to the README file.
Since Open-Sora 1.1 supports inference with dynamic input size, you can pass the input size as an argument.
# text to video
python scripts/inference.py configs/opensora-v1-1/inference/sample.py --prompt "A beautiful sunset over the city" --num-frames 32 --image-size 480 854
See here for more instructions including text-to-image, image-to-video, video-to-video, and infinite time generation.
View more
We have also provided an offline inference script. Run the following commands to generate samples, the required model weights will be automatically downloaded. To change sampling prompts, modify the txt file passed to --prompt-path
. See here to customize the configuration.
# Sample 16x512x512 (20s/sample, 100 time steps, 24 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path OpenSora-v1-HQ-16x512x512.pth --prompt-path ./assets/texts/t2v_samples.txt
# Sample 16x256x256 (5s/sample, 100 time steps, 22 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt
# Sample 64x512x512 (40s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt
# Sample 64x512x512 with sequence parallelism (30s/sample, 100 time steps)
# sequence parallelism is enabled automatically when nproc_per_node is larger than 1
torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt
The speed is tested on H800 GPUs. For inference with other models, see here for more instructions.
To lower the memory usage, set a smaller vae.micro_batch_size
in the config (slightly lower sampling speed).
High-quality data is crucial for training good generation models. To this end, we establish a complete pipeline for data processing, which could seamlessly convert raw videos to high-quality video-text pairs. The pipeline is shown below. For detailed information, please refer to data processing. Also check out the datasets we use.
Once you prepare the data in a csv
file, run the following commands to launch training on a single node.
# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
configs/opensora-v1-1/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
configs/opensora-v1-1/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
View more
Once you prepare the data in a csv
file, run the following commands to launch training on a single node.
# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x256.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
To launch training on multiple nodes, prepare a hostfile according to ColossalAI, and run the following commands.
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
For training other models and advanced usage, see here for more instructions.
See here for more instructions.
Thanks goes to these wonderful contributors:
If you wish to contribute to this project, please refer to the Contribution Guideline.
- ColossalAI: A powerful large model parallel acceleration and optimization system.
- DiT: Scalable Diffusion Models with Transformers.
- OpenDiT: An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
- PixArt: An open-source DiT-based text-to-image model.
- Latte: An attempt to efficiently train DiT for video.
- StabilityAI VAE: A powerful image VAE model.
- CLIP: A powerful text-image embedding model.
- T5: A powerful text encoder.
- LLaVA: A powerful image captioning model based on Mistral-7B and Yi-34B.
We are grateful for their exceptional work and generous contribution to open source.