Video (camera) trajectory editing (VTE) aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos.
Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models.
To address these issues, we introduce a new VTE framework that:
Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.
Overview of CamDirector. Left: The hybrid warping scheme leverages the entire source video to construct coarse frames by processing dynamic and static regions separately, providing a global reference of the original scene content.
Right: The CCDM conditions the generation on the coarse video via ControlNet, while source-frame tokens are concatenated with target tokens as inputs to the base T2V model to provide reliable motion and appearance priors.
History-guided autoregressive generation. In each iteration, T☆ previously generated frames serve as history to guide the synthesis of the next T frames, along with the corresponding T☆ + T source frames as input to produce the coarse frames and provide original scene context.
Progressive world cache update. Whenever a new segment is generated, we evenly sample C frames as anchors, where the newly inpainted regions are merged into the world cache. The updated regions are highlighted in red.
iPhone-PTZ includes ten diverse scenes featuring a broad spectrum of camera motions, such as dolly, pan, orbiting, and significantly larger fields of view.
Each scene contains two synchronized 1280 × 720 videos, with durations ranging from 5-12s, captured using identical iPhone 14 Plus devices:
@article{yin2026camdirector,
title={CamDirector: Towards Long-Term Coherent Video Trajectory Editing},
author={Kejia Yin and Zhihao Shi and Weilin Wan and Yuhongze Zhou and Yuanhao Yu and Xinxin Zuo and Qiang Sun and Juwei Lu},
journal={arXiv preprint arXiv:2603.02256},
year={2026}
}