TLDR: Reg-DPO is a new framework for improving video generation quality. It introduces GT-Pair for automatic, high-quality preference data creation without manual annotation. It also uses SFT regularization to stabilize DPO training and incorporates multiple memory optimization techniques, enabling efficient training of large video models. Experiments show it consistently produces superior video quality for both image-to-video and text-to-video tasks.
Video generation, a rapidly evolving field in artificial intelligence, faces significant hurdles when it comes to producing high-quality, realistic, and stable video content. While Direct Preference Optimization (DPO) has emerged as a promising technique to enhance video quality, its application to complex video tasks, especially with large-scale models, has been limited by challenges in data construction, training stability, and substantial memory consumption.
Researchers from ByteDance and Shanghai Jiao Tong University have introduced a novel framework called Reg-DPO, which aims to overcome these limitations. Their work, detailed in the paper Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation, presents a systematic approach to make DPO more efficient and effective for video generation.
Addressing Data Challenges with GT-Pair
One of the primary obstacles in DPO training for video generation is the high cost and difficulty of creating high-quality preference data. Traditional methods often rely on human annotation or complex automatic evaluators, which are expensive and time-consuming, especially for videos. To tackle this, the team developed the GT-Pair strategy.
GT-Pair automatically constructs high-quality preference pairs by using real, “ground-truth” videos as positive examples and videos generated by the model itself as negative examples. This innovative approach eliminates the need for any external human or automated annotation, making data construction significantly more efficient and scalable. The real videos inherently possess superior visual quality, temporal consistency, and semantic completeness compared to generated ones, creating a clear distinction for the model to learn from. This method ensures high data quality, low cost, and strong discriminability, leading to more effective training.
Enhancing Training Stability with Reg-DPO
Standard DPO, while powerful, can suffer from intrinsic instability during training. It primarily focuses on the relative difference between preferred and non-preferred samples, without directly supervising the overall distribution of generated samples. This can lead to rapid convergence, pronounced distribution shifts, and even model collapse, where the generated videos become blurry or contain artifacts.
To mitigate this, the researchers introduced Reg-DPO, which incorporates a Supervised Fine-Tuning (SFT) loss as a regularization term into the DPO objective. This SFT regularization provides an explicit constraint on positive samples, ensuring that the model consistently moves towards generating high-quality outputs. By dynamically weighting this regularization term, Reg-DPO balances preference learning with maintaining distribution consistency, leading to significantly enhanced training stability and improved generation fidelity. Experiments showed that Reg-DPO prevents the performance degradation and visual artifacts seen in vanilla DPO, producing consistently clearer and higher-quality videos.
Optimizing Memory for Large Models
Training large video generation models (often exceeding 10 billion parameters) with DPO presents immense memory challenges. Video inputs are multi-frame, and DPO requires a frozen reference model alongside the trainable one, leading to extremely high GPU memory usage and frequent “out-of-memory” errors.
The team implemented a comprehensive memory optimization scheme. This combines the Fully Sharded Data Parallel (FSDP) framework with several advanced techniques: Flash Attention for efficient attention computation, Context Parallelism for sequence-dimension parallelization, a fully parallelized pair computation strategy, prompt pre-encoding to reduce runtime overhead, model offloading for frozen modules, and refined computational graph and memory reclamation optimizations. This systematic approach achieved nearly three times higher effective training capacity compared to using FSDP alone, enabling stable training of ultra-large video models with high-resolution and long-sequence videos.
Also Read:
- Single Image to High-Quality 3D: The Wonder3D++ Approach
- PreferThinker: A New AI System for Understanding Your Unique Image Preferences
Superior Performance Across Video Tasks
Extensive experiments were conducted on both Image-to-Video (I2V) and Text-to-Video (T2V) tasks across multiple datasets. The results consistently demonstrated that Reg-DPO, combined with the GT-Pair data construction, significantly outperforms existing approaches. Evaluations using both human assessments (GSB) and automated metrics (VBench) confirmed superior video generation quality, better prompt adherence, enhanced visual consistency, reduced micro-motion videos, improved generation stability, and greater physical plausibility.
In conclusion, Reg-DPO offers a robust and efficient framework for advancing video generation. By innovatively addressing data construction, algorithmic stability, and memory optimization, this research paves the way for creating more realistic, stable, and high-quality video content with large-scale generative models.


