Yuedong Chen
·
Chuanxia Zheng
·
Haofei Xu
·
Bohan Zhuang
Andrea Vedaldi
·
Tat-Jen Cham
·
Jianfei Cai
mvsplat360.mp4
To get started, create a conda virtual environment using Python 3.10+ and install the requirements:
conda create -n mvsplat360 python=3.10
conda activate mvsplat360
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 xformers==0.0.25.post1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
This project mainly uses DL3DV and RealEstate10K datasets.
The dataset structure aligns with our previous work, MVSplat. You may refer to the script convert_dl3dv.py for converting the DL3DV-10K datasets to the torch chunks used in this project.
You might also want to check out the DepthSplat's DATASETS.md, which provides detailed instructions on pre-processing DL3DV and RealEstate10K for use here (as both projects share the same code base from pixelSplat).
A pre-processed tiny subset of DL3DV (containing 5 scenes) is provided here for quick reference. To use it, simply download it and unzip it to datasets/dl3dv_tiny
.
To render novel views,
-
get the pre-trained models dl3dv_480p.ckpt, and save it to
/checkpoints
-
run the following:
# dl3dv; requires at least 22G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 \
wandb.name=dl3dv_480P_ctx5_tgt56 \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt
- the rendered novel views will be stored under
outputs/test/{wandb.name}
To evaluate the quantitative performance, kindly refer to compute_dl3dv_metrics.py
To render videos from a pre-trained model, run the following
# dl3dv; requires at least 38G VRAM
python -m src.main +experiment=dl3dv_mvsplat360_video \
wandb.name=dl3dv_480P_ctx5_tgt56_video \
mode=test \
dataset/view_sampler=evaluation \
dataset.roots=[datasets/dl3dv_tiny] \
checkpointing.load=checkpoints/dl3dv_480p.ckpt
- Download the encoder pre-trained weight from MVSplat and save it to
checkpoints/re10k.ckpt
. - Download SVD pre-trained weight from generative-models and save it to
checkpoints/svd.safetensors
. - Run the following:
# train mvsplat360; requires at least 80G VRAM
python -m src.main +experiment=dl3dv_mvsplat360 dataset.roots=[datasets/dl3dv]
- Alternatively, you can also fine-tune from our released model by appending
checkpointing.load=checkpoints/dl3dv_480p.ckpt
andcheckpointing.resume=false
to the above command. - You can also set up your wandb account here for logging. Have fun.
The camera intrinsic matrices are normalized (the first row is divided by image width, and the second row is divided by image height). More details are at this comment.
The camera extrinsic matrices are OpenCV-style camera-to-world matrices (+X right, +Y down, +Z camera looks into the screen).
@article{chen2024mvsplat360,
title = {MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views},
author = {Chen, Yuedong and Zheng, Chuanxia and Xu, Haofei and Zhuang, Bohan and Vedaldi, Andrea and Cham, Tat-Jen and Cai, Jianfei},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2024},
}
The project is based on MVSplat, pixelSplat, UniMatch and generative-models. Many thanks to these projects for their excellent contributions!