Transfomer is efficient for processing video by joint space-time attention. (Is Space-Time Attention All You Need for Video Understanding?
https://arxiv.org/pdf/2102.05095)
RMSE to GT trajectory in KITTI. Performance increases when more frames are involved.