Overview of the motion prompt layer. Learnable Power Normalization (PN) function \( f(\cdot) \) modulates motion, influencing how motion is enhanced or dampened in each frame differencing map \( \mathbf{D} \) to highlight relevant movements. The resulting attention maps are multiplied element-wise (\(\odot\)) with the original video frame to produce video motion prompts. We introduce a temporal attention variation regularization term for smoother attention maps, ensuring better motion prompts. This layer can be inserted between the video input and backbones such as TimeSformer, serving as an adapter. Training involves optimizing both the motion prompt layer and the backbone network using a generic loss function, e.g., cross-entropy, along with the new regularization term.
Learnable slope and shift
*Double-click the slider thumb to reset its value; drag the middle chart to rotate.
Effects of temporal attention variation regularization
Comparison of PN functions on motion modulation in 3D
We compare existing PN with our PN on motion modulation. The first two columns show consecutive video frames, and the third column displays 3D surface plots of the corresponding frame differencing maps. The 4th to 7th columns show output attentions in both attention maps and 3D surface plots for Gamma, MaxExp, SigmE, and AsinhE. The last column shows outputs from our PN function, which focuses on different motions across video types, such as human actions, fine-grained actions, static and moving cameras, and anomaly detection. For UCF-Crime, we use the learned slope and shift from MPII Cooking 2, as both are captured by static cameras.
Comparison of PN functions on motion modulation in 2D
Visualizations include original consecutive frames (first two columns), frame differencing maps (third column), pairs of attention maps and motion prompts for Gamma, MaxExp, SigmE, and AsinhE (fourth to eleventh columns). The last two columns display our attention maps and motion prompts. Our attention maps (i) depict clear motion regions, (ii) highlight motions of interest and/or contextual environments relevant to the motions, and our motion prompts capture rich motion patterns. Existing PN functions only focus on motions, often capture noisy patterns and without emphasizing contexts.
@article{chen2024motion,
title={Motion meets Attention: Video Motion Prompts},
author={Chen, Qixiang and Wang, Lei and Koniusz, Piotr and Gedeon, Tom},
journal={arXiv preprint arXiv:2407.03179},
year={2024}
}