MY ALT TEXT

Abstract

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideoOmni demonstrates superior performance in generating highquality videos with precise controllability.

Overall Framework of DreamVideo-Omni

MY ALT TEXT

Overview of \(\textbf{DreamVideo-Omni}\). In Stage 1, the framework introduces an all-in-one video DiT that incorporates reference images, bboxes, and trajectories for multi-subject customization and omni-motion control. Stage 2 further enhances identity fidelity via the proposed latent identity reward feedback learning mechanism, which utilizes a latent identity reward model to directly evaluate intermediate latents, completely bypassing the expensive VAE decoder for faster training.

Reference

      
@article{wei2026dreamvideo_omni,
  title={DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning},
  author={Wei, Yujie and Liu, Xinyu and Zhang, Shiwei and Yuan, Hangjie and Xing, Jinbo and Chen, Zhekai and Wang, Xiang and Qiu, Haonan and Zhao, Rui and Feng, Yutong and Chu, Ruihang and Zhang, Yingya and Guo, Yike and Liu, Xihui and Shan, Hongming},
  journal={arXiv preprint arXiv: 2603.12257},
  year={2026}
}