DAUfomer: A Deep Adaptive Uncertainty-Driven Transformer for Action Recognition

Yingcun  Wang; Rong  Wang; Shizhong  Liu; Zhi  Zeng

doi:10.6180/jase.202604_29(4).0009

DAUfomer: A Deep Adaptive Uncertainty-Driven Transformer for Action Recognition

Computer Science and Information Engineering

Yingcun Wang¹, Rong Wang²This email address is being protected from spambots. You need JavaScript enabled to view it., Shizhong Liu³, and Zhi Zeng⁴

¹School of Information Engineering, Weifang Vocational College, Weifang 261031, China

²School of International Business, Weifang Vocational College, Weifang 262737, China

³Weifang Vocational College, Weifang 261031, China

⁴Shandong Qiaotong Tianxia Network Technology Co., Ltd, Weifang, 261061, China

Received: May 2, 2025
Accepted: July 13, 2025
Publication Date: July 30, 2025

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.

Download Citation: ||https://doi.org/10.6180/jase.202604_29(4).0009

Current action recognition research excessively relies on deterministic feature correlation mechanisms, struggling to address core challenges including spatiotemporal heterogeneity, action category ambiguity, and cross-frame semantic discontinuity prevalent in video stream data. To this end, this study proposes the Deep Robust Recognition Transformer (DAUfomer), to reconstruct action recognition paradigms through three synergistic modules. Multi-granularity feature extraction module employs Transformer with dual attentions to extract low-dimensional and high-information-density spatiotemporal features from high-dimensional video streams, preserving local motion details while establishing global contextual correlations. Uncertainty-driven spatial-temporal aggregation module innovatively constructs a hybrid Gaussian-Dirichlet distribution model, transforming deterministic spatiotemporal attention into a probabilistic learnable Bayesian network. This enables dynamic adaptation to data distribution shifts through latent space uncertainty quantification. Proactive semantic enhancement architecture breaks traditional causal constraints in temporal modeling by designing a bidirectional temporal distillation mechanism. It leverages latent semantic cues from future frames to construct cross-frame attention correlation graphs, enhancing current action features via gated recurrent unit-based spatiotemporal context refinement. Finally, extensive results on two real world datasets, especially on the MS Dance Action dataset with a 4.48% ACC improvement compared to the second-best result, verify that DAUfomer conducts a new standard baseline in the action recognition task.

Keywords: Multimedia action recognition; probability-driven aggregation; proactive semantic enhancement

[1] Y. Zhong, L. Chen, C. Dan, and A. Rezaeipanah, (2022) “A systematic survey of data mining and big data analysis in internet of things" The Journal of Super computing 78(17): 18405–18453.
[2] P. Li, J. Gao, J. Zhang, S. Jin, and Z. Chen, (2022) “Deep Reinforcement Clustering" IEEE Transactions on Multimedia: DOI: 10.1109/TMM.2022.3233249.
[3] M. Korban, P. Youngs, and S. T. Acton, (2023) “A multi-modal transformer network for action detection" Pattern Recognition 142: 109713.
[4] J. Gao, M. Liu, P. Li, J. Zhang, and Z. Chen, (2024) “Deep Multiview Adaptive Clustering With Semantic In variance" IEEE Transactions on Neural Networks and Learning Systems 35(9): 12965–12978. DOI: 10.1109/TNNLS.2023.3265699.
[5] J. Gao, M. Liu, P. Li, A. A. Laghari, A. R. Javed, N. Victor, and T. R. Gadekallu, (2023) “Deep Incomplete Multiview Clustering via Information Bottleneck for Pat tern Mining of Data in Extreme-Environment IoT" IEEE Internet of Things Journal 11(16): 26700–26712. DOI: 10.1109/JIOT.2023.3325272.
[6] Z. Lin. “Dance movement recognition method based on convolutional neural network”. In: 2023 4th Inter national Conference on Computer Vision, Image and Deep Learning (CVIDL). 2023, 255–258.
[7] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. “Mnasnet: Platform-aware neural architecture search for mobile”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, 2820–2828.
[8] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. “Learning transferable architectures for scalable image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, 8697–8710.
[9] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Mur phy. “Progressive neural architecture search”. In: Pro ceedings of the European conference on computer vision (ECCV). 2018, 19–34.
[10] J. Lee, M. Lee, D. Lee, and S. Lee. “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 10444–10453.
[11] J. Lee, M. Lee, D. Lee, and S. Lee. “Hierarchi cally decomposed graph convolutional networks for skeleton-based action recognition”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 10444–10453.
[12] F. Zhu and R. Zhu, (2021) “Dance Action Recognition and Pose Estimation Based on Deep Convolutional Neural Network." Traitement DuSignal38(2): DOI: 10.18280/ts.380233.
[13] Z. Zhan, X. Mao, H. Liu, and S. Yu, (2025) “STGL: Self-Supervised Spatio-Temporal Graph Learning for Traffic Forecasting" Journal of Artificial Intelligence Research 2(1): 1–8. DOI: 10.70891/JAIR.2025.040001.
[14] W. Zhang and J. Wang, (2024) “English Text Sentiment Analysis Network based on CNN and U-Net" Journal of Science and Engineering 1(1): 13–18. DOI: 10.70891/ JSE.2024.100009.
[15] B. Li, Y. Zhao, S. Zhelun, and L. Sheng.“Danceformer: Music conditioned 3d dance generation with para metric motion transformer”. In: Proceedings of the AAAI Conference on Artificial Intelligence. 36. 2. 2022, 1272–1279. DOI: 10.1609/aaai.v36i2.20014.