VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

1University of California, Santa Cruz, 2Snap Research, 3KAUST, 4University of Texas at Dallas

Abstract

Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistency edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal VIdeo Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, the foundation of VIA is a novel test-time editing adaptation method, which adapts a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that adapts consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potentials for advanced video editing tasks over long video sequences.

Long Video Editing

VIA is capable of instruction editing for video of minutes. Besides, it has all the editing ability of existing image editing model: style transfer, reasoning-based editing, background modification, object insertion, object swapping, etc.

Original Video

Night scene with street light

Snowy landscape

Add a sense of nostalgia

Work of Post-impressionism

Original Video

Driving on a river in a forest

Japanese woodblock print

Driving toward sun during sunset time

Apply a winter theme with tiny snow-capped peaks

Previse Editing

VIA accurately identity the editing area and preserve the non-target pixels from the input video while faithfully conduct editing. This also enable many advanced video editing tasks. For example, whereas other methods can only do stylization on the whole image, our model could achieve a local stylization.

Original Video

Change to lion

Change to wolf

Make it black

Add some snow

In Van Gogh style

Add snow to the dog

Dog in Van Gogh style

Method Overview

Overview of our VIA framework. For local consistency, Test-time Editing Adaptation f inetunes the editing model with augmented editing pairs to ensure the consistent editing directions with the text instruction, and Local Latent Adaptation achieves precise editing control and preserves non-target pixels from the input video. For global consistency, Spatiotemporal Adaptation collects and applies key attention variables across all frames.

Human Evaluation

We compare our model with five previous open-source methods from three aspects. ‘Tie’ indicates the two models are on par with each other.

BibTeX

@article{gu2024via,
      title={VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing}, 
      author= {Gu, Jing and Fang, Yuwei and Skorokhodov, Ivan and Wonka, Peter and Du, Xinya and Tulyakov, Sergey and Wang, Xin Eric},
      journal={arXiv preprint arXiv:2406.12831},
      year={2024}
}