Abstract
- Tracking dense 3D motion from monocular videos remains challenging, particularly - when aiming for pixel-level precision over long sequences. We introduce DELTA, - a novel method that efficiently tracks every pixel in 3D space, enabling accurate + Tracking dense 3D motion from monocular videos remains challenging, particularly + when aiming for pixel-level precision over long sequences. We introduce DELTA, + a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a - transformer-based upsampler to achieve high-resolution predictions. Unlike existing + transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify - log-depth as the optimal choice. Extensive experiments demonstrate the superiority + log-depth as the optimal choice. Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space. @@ -239,7 +239,7 @@
Motivation
Existing motion prediction methods struggle with short-term, sparse predictions and often fail to deliver accurate 3D motion estimations while optimization-based approaches require substantial time to process a single video. - We are the first method capable of efficiently tracking every pixel in 3D space over hundreds of frames from monocular videos, and achieves + We are the first method capable of efficiently tracking every pixel in 3D space over hundreds of frames from monocular videos, and achieves state-of-the-art accuracy on 3D tracking benchmarks.