-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
231 lines (218 loc) · 15.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
<!DOCTYPE html>
<html>
<head>
<script>
window.dataLayer = window.dataLayer || [];
</script>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, shrink-to-fit=no">
<title>AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.5.0/css/bootstrap.min.css">
<link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,500,600' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="data/assets/css/styles.css">
<link rel="apple-touch-icon" sizes="180x180" href="apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="data/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="data/favicon-16x16.png">
<link rel="manifest" href="site.webmanifest">
<meta property="og:site_name" content="AC3D" />
<meta property="og:type" content="video.other" />
<meta property="og:title" content="AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers" />
<meta property="og:description" content="AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers" />
<meta property="og:url" content="" />
</head>
<body>
<div class="highlight-clean" style="padding-bottom: 0px; padding-top: 20px;">
<div class="container" style="max-width: 1024px; margin-bottom: 20px">
<h1 class="text-center" style="font-size:33px;"><b>AC3D</b>: Analyzing and Improving 3D Camera Control </br> in Video Diffusion Transformers</h1>
</div>
</div>
<div class="container" style="max-width: 1024px; margin-bottom: 20px;">
<div class="authors" style="font-size:6px;">
<a href=https://sherwinbahmani.github.io/>
Sherwin Bahmani<sup>*1,2,3</sup>
</a>
<a href=https://universome.github.io/>
Ivan Skorokhodov<sup>*3</sup>
</a>
<a href=https://guochengqian.github.io/>
Guocheng Qian<sup>3</sup>
</a>
</a>
<a href=https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website/>
Aliaksandr Siarohin<sup>3</sup>
</a>
<br>
</div>
<div class="authors">
<a href=https://www.willimenapace.com/>
Willi Menapace<sup>3</sup>
</a>
<a href=https://taiya.github.io/>
Andrea Tagliasacchi<sup>1,4</sup>
</a>
<a href=https://davidlindell.com/>
David B. Lindell<sup>1,2</sup>
</a>
<a href=http://www.stulyakov.com/>
Sergey Tulyakov<sup>3</sup>
</a>
</div>
<div class="affiliations" style="max-width: 1024px; margin-bottom: 10px">
<span><sup>1</sup>University of Toronto </span>
<span><sup>2</sup>Vector Institute </span>
<span><sup>3</sup>Snap Inc. </span>
<span><sup>4</sup>SFU </span>
</div>
<div class="affiliations" style="max-width: 1024px; margin-bottom: 10px; margin-top: 10px">
<span><sup>*</sup> equal contribution </span>
</div>
<div class="container" style="max-width: 1024px; margin-bottom: 10px">
<h1 class="text-center" style="font-size:22px;">arXiv 2024</h1>
</div>
</div>
<div id="container">
<div class="buttons" style="margin-top: 8px; margin-bottom: 8px;">
<a class="btn btn-light" role="button" href="https://arxiv.org/abs/2411.18673">
<svg style="width:24px;height:24px;margin-left:-12px;margin-right:12px" viewBox="0 0 24 24">
<path fill="currentColor" d="M16 0H8C6.9 0 6 .9 6 2V18C6 19.1 6.9 20 8 20H20C21.1 20 22 19.1 22 18V6L16 0M20 18H8V2H15V7H20V18M4 4V22H20V24H4C2.9 24 2 23.1 2 22V4H4M10 10V12H18V10H10M10 14V16H15V14H10Z"></path>
</svg>arXiv
</a>
</div>
</div>
<hr class="divider" />
<div class="container" style="max-width: 768px;">
<div class="compositional captioned_videos">
<video class="video lazy" autoplay loop playsinline muted>
<source data-src="data/videos/teaser.mp4" type="video/mp4"></source>
</video>
<h6>"Three fluffy sheep sit side by side at a rustic wooden table, each eagerly digging into their bowls of spaghetti."</h6>
</div>
</div>
<hr class="divider" />
<div class="container" style="max-width: 768px;">
<div class="row">
<div class="col-md-12">
<div class="row">
<div class="col-sm-12">
<h2>Abstract</h2>
</div>
</div>
<p>
In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.
</p>
</div>
</div>
</div>
<hr class="divider" />
<div class="container" style="max-width: 768px;">
<div class="row">
<div class="col-sm-12">
<h2>Method</h2>
</div>
</div>
<div class="row captioned_videos">
<div class="col-md-12">
<p>
VDiT-CC model with ControlNet camera conditioning built on top of VDiT. Video synthesis is performed by large 4,096-dimensional DiT-XL blocks of the frozen VDiT backbone, while VDiT-CC only processes and injects the camera information through lightweight 128-dimensional DiT-XS blocks (FC stands for fully-connected layers).
</p>
<img src="data/method.png" alt="architecture" style="width: 100%">
</div>
</div>
</div>
<hr class="divider" />
<div class="container" style="max-width: 768px;">
<div class="row">
<div class="col-sm-12">
<h2>Our Results</h2>
<h6>We visualize a sequence of 8 different camera trajectories (40 seconds total) shared across all prompts.</h6>
</div>
</div>
</div>
<div class="row captioned_videos" style="max-width: 768px;margin-left: auto;margin-right: auto;">
<div class="col-xs-12" style="width: 100%; text-align: center;">
<div class="video-compare-container2" style="width: 100%; display: flex; justify-content: center; align-items: center;">
<video class="video lazy" id="combined-video" loop playsinline autoPlay muted src="data/videos/ours/1_2.mp4" onplay="resizeAndPlay(this)" style="max-width: 100%; height: auto;"></video>
</div>
<div style="display: flex; justify-content: space-between; width: 100%; margin-top: 0px;; margin-bottom: 20px;">
<h6 class="caption" style="font-size:11px; width: 48%; padding-right: 10px;">In a sophisticated art studio, a cat wearing a beret sits at an easel, delicately painting on a tiny canvas.</h6>
<h6 class="caption" style="font-size:11px; width: 48%; padding-left: 10px;">In a futuristic kitchen, an astronaut expertly cooks with a pan over a small, controlled flame. There is a pond with a group of curious ducks that swim nearby.</h6>
</div>
</div>
<div class="col-xs-12" style="width: 100%; text-align: center;">
<div class="video-compare-container2" style="width: 100%; display: flex; justify-content: center; align-items: center;">
<video class="video lazy" id="combined-video" loop playsinline autoPlay muted src="data/videos/ours/3_4.mp4" onplay="resizeAndPlay(this)" style="max-width: 100%; height: auto;"></video>
</div>
<div style="display: flex; justify-content: space-between; width: 100%; margin-top: 0px;; margin-bottom: 20px;">
<h6 class="caption" style="font-size:11px; width: 48%; padding-right: 10px;">A teddy bear diligently washes dishes in a cozy kitchen.</h6>
<h6 class="caption" style="font-size:11px; width: 48%; padding-left: 10px;">A golden retriever, sitting on the sand at a tropical beach, eagerly devours an ice cream cone. The sun sets in the background, casting a golden hue over the calm waves.</h6>
</div>
</div>
<div class="col-xs-12" style="width: 100%; text-align: center;">
<div class="video-compare-container2" style="width: 100%; display: flex; justify-content: center; align-items: center;">
<video class="video lazy" id="combined-video" loop playsinline autoPlay muted src="data/videos/ours/5_6.mp4" onplay="resizeAndPlay(this)" style="max-width: 100%; height: auto;"></video>
</div>
<div style="display: flex; justify-content: space-between; width: 100%; margin-top: 0px;; margin-bottom: 20px;">
<h6 class="caption" style="font-size:11px; width: 48%; padding-right: 10px;">A squirrel sits contentedly on a park bench, nibbling on a juicy burger with its tiny paws. The park around it is filled with trees and flowers in full bloom, and a few curious birds watch from nearby branches.</h6>
<h6 class="caption" style="font-size:11px; width: 48%; padding-left: 10px;">An otter, expertly operating an espresso machine in a cozy, warmly lit café, moves its tiny paws with great precision as it grinds fresh coffee beans and steams milk.</h6>
</div>
</div>
<div class="col-xs-12" style="width: 100%; text-align: center;">
<div class="video-compare-container2" style="width: 100%; display: flex; justify-content: center; align-items: center;">
<video class="video lazy" id="combined-video" loop playsinline autoPlay muted src="data/videos/ours/7_8.mp4" onplay="resizeAndPlay(this)" style="max-width: 100%; height: auto;"></video>
</div>
<div style="display: flex; justify-content: space-between; width: 100%; margin-top: 0px;; margin-bottom: 20px;">
<h6 class="caption" style="font-size:11px; width: 48%; padding-right: 10px;">In a chic urban kitchen, a cat wearing a small chef's hat expertly kneads dough on a sleek marble countertop.</h6>
<h6 class="caption" style="font-size:11px; width: 48%; padding-left: 10px;">An astronaut cooking with a pan in the kitchen.</h6>
</div>
</div>
<div class="col-xs-12" style="width: 100%; text-align: center;">
<div class="video-compare-container2" style="width: 100%; display: flex; justify-content: center; align-items: center;">
<video class="video lazy" id="combined-video" loop playsinline autoPlay muted src="data/videos/ours/9_10.mp4" onplay="resizeAndPlay(this)" style="max-width: 100%; height: auto;"></video>
</div>
<div style="display: flex; justify-content: space-between; width: 100%; margin-top: 0px;; margin-bottom: 20px;">
<h6 class="caption" style="font-size:11px; width: 48%; padding-right: 10px;">A cyborg koala, wearing a pair of headphones and standing in front of a high-tech turntable, DJs on a rooftop in a futuristic, neon-lit Tokyo. The rain falls in sheets around it, creating a shimmering effect as it mixes beats.</h6>
<h6 class="caption" style="font-size:11px; width: 48%; padding-left: 10px;">Cats, dressed in formal attire, sit around an elaborate chessboard, each pondering their next strategic move in the tense match.</h6>
</div>
</div>
<div class="col-xs-12" style="width: 100%; text-align: center;">
<div class="video-compare-container2" style="width: 100%; display: flex; justify-content: center; align-items: center;">
<video class="video lazy" id="combined-video" loop playsinline autoPlay muted src="data/videos/ours/11_12.mp4" onplay="resizeAndPlay(this)" style="max-width: 100%; height: auto;"></video>
</div>
<div style="display: flex; justify-content: space-between; width: 100%; margin-top: 0px;; margin-bottom: 0px;">
<h6 class="caption" style="font-size:11px; width: 48%; padding-right: 10px;">Amidst the ruined remnants of a once-thriving city, a lone robot scavenger sifts through the debris, its metallic fingers reaching through broken concrete and twisted metal in search of valuable salvage.</h6>
<h6 class="caption" style="font-size:11px; width: 48%; padding-left: 10px;">A mouse dressed in Renaissance attire, holding a slice of cheese delicately between its paws and eating it.</h6>
</div>
</div>
</div>
<hr class="divider" />
<div class="container" style="max-width: 768px;">
<div class="row">
<div class="col-md-12">
<h2>Citation</h2>
<code>
@article{bahmani2024ac3d,<br>
author = {Bahmani, Sherwin and Skorokhodov, Ivan and Qian, Guocheng and Siarohin, Aliaksandr and Menapace, Willi and Tagliasacchi, Andrea and Lindell, David B. and Tulyakov, Sergey},<br>
title = {AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers},<br>
journal = {arXiv preprint arXiv:2411.18673},<br>
year = {2024},<br>
}</code></div>
</div>
</div>
<hr class="divider" />
<div class="container" style="max-width: 768px;">
<footer>
<p> Website template from <a href="https://dreamfusion3d.github.io/">DreamFusion</a> and <a href="https://mv-dream.github.io/">MVDream</a> . We thank the authors for the open-source code.</p>
</footer>
</div>
<script src="data/assets/js/yall.js"></script>
<script>
yall(
{
observeChanges: true
}
);
</script>
</body>
</html>