VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control


Out-of-Distribution Camera Trajectories

We apply different translations and rotations and their combinations to to the same initial scenes. We demonstrate the ability to handle a large variety of user-defined cameras and directional changes. We use the same seed for all videos and do not cherry-pick any results.

Rotation Around Clockwise

Rotation Around Anticlockwise

Rotation Clockwise (No Translation)

Rotation Anticlockwise (No Translation)

Zoom Out, then Up

Translation Right, then Rotation Anticlockwise

Translation Left, then Rotation Clockwise

Translation Left

Translation Right

Translation Up

Translation Down

Vanilla DiT Results

We use a pre-trained vanilla DiT model in the latent space of CogVideoX and fine-tune it for camera control with our mechanism. Our approach generalizes to other transformer architectures and pipelines.

Rotation Around Antilockwise

Zoom Out, then Up