We apply different translations and rotations and their combinations to to the same initial scenes. We demonstrate the ability to handle a large variety of user-defined cameras and directional changes. We use the same seed for all videos and do not cherry-pick any results.
Rotation Around Clockwise
Rotation Around Anticlockwise
Rotation Clockwise (No Translation)
Rotation Anticlockwise (No Translation)
Zoom Out, then Up
Translation Right, then Rotation Anticlockwise
Translation Left, then Rotation Clockwise
Translation Left
Translation Right
Translation Up
Translation Down
Vanilla DiT Results
We use a pre-trained vanilla DiT model in the latent space of CogVideoX and fine-tune it for camera control with our mechanism. Our approach generalizes to other transformer architectures and pipelines.