CamDirector: Towards Long-Term Coherent Video Trajectory Editing

Source

RecamMaster

TrajectoryCrafter

Gen3C

CamDirector

Abstract

Video (camera) trajectory editing (VTE) aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos.

Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models.

To address these issues, we introduce a new VTE framework that:

Explicitly aggregates information across the entire source video via a hybrid warping scheme. Static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement.
Processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence.

Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

Method

Overview of CamDirector. Left: The hybrid warping scheme leverages the entire source video to construct coarse frames by processing dynamic and static regions separately, providing a global reference of the original scene content.

Right: The CCDM conditions the generation on the coarse video via ControlNet, while source-frame tokens are concatenated with target tokens as inputs to the base T2V model to provide reliable motion and appearance priors.

Illustration of history-guided autoregressive generation

History-guided autoregressive generation. In each iteration, T^☆ previously generated frames serve as history to guide the synthesis of the next T frames, along with the corresponding T^☆ + T source frames as input to produce the coarse frames and provide original scene context.

Illustration of progressive world cache update

Progressive world cache update. Whenever a new segment is generated, we evenly sample C frames as anchors, where the newly inpainted regions are merged into the world cache. The updated regions are highlighted in red.

iPhone-PTZ Benchmark

iPhone-PTZ includes ten diverse scenes featuring a broad spectrum of camera motions, such as dolly, pan, orbiting, and significantly larger fields of view.

Each scene contains two synchronized 1280 × 720 videos, with durations ranging from 5-12s, captured using identical iPhone 14 Plus devices:

Casual Setting: Recorded by casual users under handheld settings to simulate amateur footage.
Professional Setting: Recorded by professional operators using a DJI Osmo Mobile 7P PTZ to introduce cinematic camera motions as the ground truth for trajectory editing.

BibTeX

@article{yin2026camdirector,
  title={CamDirector: Towards Long-Term Coherent Video Trajectory Editing}, 
  author={Kejia Yin and Zhihao Shi and Weilin Wan and Yuhongze Zhou and Yuanhao Yu and Xinxin Zuo and Qiang Sun and Juwei Lu},
  journal={arXiv preprint arXiv:2603.02256},
  year={2026}
}

CamDirector: Towards Long-Term Coherent Video Trajectory Editing

We present a novel video trajectory editing framework capable of generating new videos along desired trajectories from the given ones to achieve aesthetically pleasing and cinematic camera movements.

Abstract

Method

iPhone-PTZ Benchmark

BibTeX