CamDirector animated icon

CamDirector: Towards Long-Term Coherent Video Trajectory Editing

CVPR 2026
1McMaster University, 2University of Toronto, 3The University of Hong Kong, 4McGill University, 5Concordia University 6MBZUAI
* Equal contribution

We present a novel video trajectory editing framework capable of generating new videos along desired trajectories from the given ones to achieve aesthetically pleasing and cinematic camera movements.


Source
RecamMaster
TrajectoryCrafter
GT
Gen3C
CamDirector


Source
RecamMaster
TrajectoryCrafter
GT
Gen3C
CamDirector

Abstract

Video (camera) trajectory editing (VTE) aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos.

Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models.

To address these issues, we introduce a new VTE framework that:

  • Explicitly aggregates information across the entire source video via a hybrid warping scheme. Static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement.
  • Processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence.

Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

Method

Overview of CamDirector

Overview of CamDirector. Left: The hybrid warping scheme leverages the entire source video to construct coarse frames by processing dynamic and static regions separately, providing a global reference of the original scene content.

Right: The CCDM conditions the generation on the coarse video via ControlNet, while source-frame tokens are concatenated with target tokens as inputs to the base T2V model to provide reliable motion and appearance priors.

Illustration of history-guided autoregressive generation

History-guided autoregressive generation. In each iteration, T previously generated frames serve as history to guide the synthesis of the next T frames, along with the corresponding T + T source frames as input to produce the coarse frames and provide original scene context.

Illustration of progressive world cache update

Progressive world cache update. Whenever a new segment is generated, we evenly sample C frames as anchors, where the newly inpainted regions are merged into the world cache. The updated regions are highlighted in red.

iPhone-PTZ Benchmark

iPhone-PTZ includes ten diverse scenes featuring a broad spectrum of camera motions, such as dolly, pan, orbiting, and significantly larger fields of view.

Each scene contains two synchronized 1280 × 720 videos, with durations ranging from 5-12s, captured using identical iPhone 14 Plus devices:

  • Casual Setting: Recorded by casual users under handheld settings to simulate amateur footage.
  • Professional Setting: Recorded by professional operators using a DJI Osmo Mobile 7P PTZ to introduce cinematic camera motions as the ground truth for trajectory editing.

BibTeX

@article{yin2026camdirector,
  title={CamDirector: Towards Long-Term Coherent Video Trajectory Editing}, 
  author={Kejia Yin and Zhihao Shi and Weilin Wan and Yuhongze Zhou and Yuanhao Yu and Xinxin Zuo and Qiang Sun and Juwei Lu},
  journal={arXiv preprint arXiv:2603.02256},
  year={2026}
}