CoMotion: Concurrent Multi-person 3D Motion

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper Supplemental

Authors

Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan Richter, Vladlen Koltun

Abstract

We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time.