Rui Nie1, 2

Jinxiao Lin2

1Beihang University   2AISphere Tech

*Equal Contribution   Corresponding Author

Code

(Coming Soon)

beautiful teaser
Example videos generated by our proposed TrackGo. Given an initial frame, users specify the target moving object(s) or part(s) using free-form masks and indicate the desired movement trajectory with arrows. TrackGo is capable of generating subsequent video frames with precise control. It can handle complex scenarios that involve multiple objects, fine-grained object parts, and sophisticated movement trajectories.

Abstract

Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores.


Video(about 1min)



Methodology


Our method consists of two parts: image-to-video(I2V) model and TrackAdapter.

method
Top: Pipeline of Point Trajectories Generation. User's inputs are divided into masks and trajectory vectors for processing. Each mask corresponds to a trajectory vector. For each mask area, K * s points are randomly selected. The trajectory vector is then subdivided by the frame number to attain the relative displacement ๐’ฏ of each point between frames between adjacent frames. The final step is to combine this relevant data to construct point trajectories. Bottom: Overview of TrackGo. TrackGo generates videos by taking user input ๐ˆ and latent input ๐ณ ๐ญ as inputs based on an image-to-video diffusion model. Through the pipeline of point trajectories generation, point trajectories ๐ can be obtained from ๐ˆ . Then the point trajectories ๐ are passed through the Encoder โ„ฐ and injected into the model via the TrackAdapter. Architecture of TrackAdapter describes the calculation process of TrackAdapter.

Camera Motion


TrackGo can also achieve the effect of camera motion. By simply selecting the entire screen as the motion area, an effect that moves along a specified trajectory can be achieved.


Experiments


We compare our approach with other approaches on VIPSeg dataset and our internal dataset. We select FVD, FID and ObjMC as evaluation metrics.


VIPSeg Internal dataset
Method Base Arch FVD ↓ FID ↓ ObjMC ↓ FVD ↓ FID ↓ ObjMC ↓
DragNUWA SVD 321.31 30.15 298.98 178.37 38.07 129.80
DragAnything SVD 294.91 28.16 236.02 169.73 32.85 133.89
TrackGo SVD 248.27 25.60 191.15 136.11 29.19 79.52


Citation

@article{zhou2024trackgo,
  title={TrackGo: A Flexible and Efficient Method for Controllable Video Generation },
  author={Haitao, Zhou and Chuang, Wang and Rui, Nie and Jinxiao, Lin and Dongdong, Yu and Qian, Yu and Changhu, Wang},
  booktitle={arXiv preprint},
  year={2024}
}