UniVST: A Unified Framework for Training-free Localized Video Style Transfer

¹Key Laboratory of Multimedia Trusted Perception and Efficient Computing,
Ministry of Education of China, Xiamen University, China
²Rakuten Asia Pte. Ltd. ³National University of Singapore
^✉️Corresponding authors.
Accepted by TPAMI 2025

UniVST

Style Image	Original Video	Stylized Video
Style Image	Original Video	Stylized Video

Style Image	Original Video	Stylized Video
Style Image	Original Video	Stylized Video

Abstract

This paper presents UniVST, a unified framework for localized video style transfer based on diffusion models. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages the feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) A training-free AdaIN-guided localized video stylization mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding-window consistent smoothing scheme that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in stylized video. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation. Our code is available at GitHub.

Method

We propose UniVST, a unified framework for training-free localized video style transfer based on diffusion models. UniVST first applies DDIM inversion to obtain the initial noise of the original video and style image, and simultaneously integrates Point-Matching Mask Propagation to generate object masks. It then performs AdaIN-Guided Localized Video Stylization with a threebranch architecture for information interaction. Furthermore, SlidingWindow Consistent Smoothing is incorporated into the denoising process, thereby enhancing the temporal consistency in the latent space. The framework is illustrated below:

@article{song2024univst, title={UniVST: A Unified Framework for Training-free Localized Video Style Transfer}, author={Song, Quanjian and Lin, Mingbao and Zhan, Wengyi and Yan, Shuicheng and Cao, Liujuan and Ji, Rongrong}, journal={arXiv preprint arXiv:2410.20084}, year={2024} }

Style Image	Original Video	UniVST (Ours)	Diffutoon	StyleCrafter	BIVDiff	AnyV2V	EFDM^*	CAST^*

Style Image	Original Video	UniVST (Ours)	Diffutoon	StyleCrafter	BIVDiff	AnyV2V	EFDM^*	CAST^*

Style Image	Original Video	UniVST (Ours)	Diffutoon	StyleCrafter	BIVDiff	AnyV2V	EFDM^*	CAST^*

Style Image	Original Video	UniVST (Ours)	Diffutoon	StyleCrafter	BIVDiff	AnyV2V	EFDM^*	CAST^*

Style Image	Original Video	UniVST (Ours)	Gen-1	Gen-3

UniVST: A Unified Framework for Training-free Localized Video Style Transfer

TL;DR: We presents UniVST, a unified framework for localized video style transfer based on diffusion models. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos.

UniVST

Abstract

Method

Experiment Results

Single Step Video Style Transfer

Multi Step Video Style Transfer

Comparison

With Open-Source Models

With Commercial Models

BibTeX