Motion meets Attention: Video Motion Prompts

Videos contain rich spatio-temporal information. Traditional methods for extracting motion, used in tasks such as action recognition, often rely on visual contents rather than precise motion features. This phenomenon is referred to as 'blind motion extraction' behavior, which proves inefficient in capturing motions of interest due to a lack of motion-guided cues.

Recently, attention mechanisms have enhanced many computer vision tasks by effectively highlighting salient visual areas. Inspired by this, we propose using a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to activate and modulate motion signals derived from frame differencing maps. This approach generates a sequence of attention maps that enhance the processing of motion-related video content. To ensure temporally continuity and smoothness of the attention maps, we apply pair-wise temporal attention variation regularization to remove unwanted motions (e.g., noise) while preserving important ones. We then perform Hadamard product between each pair of attention maps and the original video frames to highlight the evolving motions of interest over time. These highlighted motions, termed video motion prompts, are subsequently used as inputs to the model instead of the original video frames. We formalize this process as a motion prompt layer and incorporate the regularization term into the loss function to learn better motion prompts.

This layer serves as an adapter between the model and the video data, bridging the gap between traditional 'blind motion extraction' and the extraction of relevant motions of interest.

Overview of the motion prompt layer. Learnable Power Normalization (PN) function \( f(\cdot) \) modulates motion, influencing how motion is enhanced or dampened in each frame differencing map \( \mathbf{D} \) to highlight relevant movements. The resulting attention maps are multiplied element-wise (\(\odot\)) with the original video frame to produce video motion prompts. We introduce a temporal attention variation regularization term for smoother attention maps, ensuring better motion prompts. This layer can be inserted between the video input and backbones such as TimeSformer, serving as an adapter. Training involves optimizing both the motion prompt layer and the backbone network using a generic loss function, e.g., cross-entropy, along with the new regularization term.

Comparison of PN functions on motion modulation in 3D

(a) Action run from HMDB-51.

(b) Action situp from HMDB-51.

(d) Action whip from MPII Cooking 2.

(e) Action (Balance Beam) leap forward with leg change from FineGym.

(f) Action (Uneven Bar) giant circle backward from FineGym.

(g) Action explosion from UCF-Crime.

(h) Action fighting from UCF-Crime.

We compare existing PN with our PN on motion modulation. The first two columns show consecutive video frames, and the third column displays 3D surface plots of the corresponding frame differencing maps. The 4th to 7th columns show output attentions in both attention maps and 3D surface plots for Gamma, MaxExp, SigmE, and AsinhE. The last column shows outputs from our PN function, which focuses on different motions across video types, such as human actions, fine-grained actions, static and moving cameras, and anomaly detection. For UCF-Crime, we use the learned slope and shift from MPII Cooking 2, as both are captured by static cameras.

Comparison of PN functions on motion modulation in 2D

(a) Action run from HMDB-51.

(b) Action situp from HMDB-51.

(d) Action whip from MPII Cooking 2.

(e) Action (Balance Beam) leap forward with leg change from FineGym.

(f) Action (Uneven Bar) giant circle backward from FineGym.

(g) Action explosion from UCF-Crime.

(h) Action fighting from UCF-Crime.

Visualizations include original consecutive frames (first two columns), frame differencing maps (third column), pairs of attention maps and motion prompts for Gamma, MaxExp, SigmE, and AsinhE (fourth to eleventh columns). The last two columns display our attention maps and motion prompts. Our attention maps (i) depict clear motion regions, (ii) highlight motions of interest and/or contextual environments relevant to the motions, and our motion prompts capture rich motion patterns. Existing PN functions only focus on motions, often capture noisy patterns and without emphasizing contexts.

BibTeX

@inproceedings{
    chen2024motion,
    title={Motion meets Attention: Video Motion Prompts},
    author={Qixiang Chen and Lei Wang and Piotr Koniusz and Tom Gedeon},
    booktitle={The 16th Asian Conference on Machine Learning (Conference Track)},
    year={2024},
    url={https://openreview.net/forum?id=nIDAT99Vhb}
}

Motion meets Attention:
Video Motion Prompts

Abstract

Pipeline

Learnable slope and shift

FineGym: balance beam

HMDB-51: Drink

MPII Cooking 2: Whip

UCF-Crime: Explosion

Effects of temporal attention variation regularization

FineGym: uneven bar

Comparison of PN functions on motion modulation in 3D

Comparison of PN functions on motion modulation in 2D

BibTeX