Sujun ZhouThis email address is being protected from spambots. You need JavaScript enabled to view it., Ruiwen Liu, Shenlei Shi, and Dongyan Wang
Zaozhuang Vocational College, Zaozhuang , Shandong, 277800, China
Received: November 7, 2025 Accepted: January 6, 2026 Publication Date: March 5, 2026
Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.
English-speaking anxiety significantly impacts learners’ communication, and automated diagnosis is crucial. Existing multimodal approaches face limitations in deep feature extraction and modality fusion. This study proposes MSATC, a co-attention-based multimodal fusion model. The speech S-ABHC model extracts deep acoustic features using MFCCs, spectrograms, raw waveforms, BiGRU, and HuBERT, while RoBERTa captures text sentiment. The co-attention module enables bidirectional, adaptive feature interaction, producing dis criminative joint representations. Experiments on IEMOCAP show 77.41% weighted and 78.66% unweighted accuracy, outperforming existing models.
[1] C. Zheng, T. Zhang, X. Chen, H. Zhang, J. Wan, and B.Wu,(2024)“Assessing learners’ English public speaking anxiety with multimodal deep learning technologies" Computer Assisted Language Learning1: 1–29.DOI: 10.1080/09588221.2024.2351129.
[2] C. Zhang, Y. Meng, and X. Ma,(2024) “Artificial intelligence in EFL speaking: Impact on enjoyment, anxiety, and willingness to communicate "System 121: 103259. DOI:10.1016/j.system.2024.103259.
[3] C. Wang, B. Zou, Y. Du, and Z. Wang, (2024)“The impact of different conversational generative AI chatbots on EFL learners: An analysis of willingness to communicate, foreign language speaking anxiety, and self-perceived communicative competence "System127: 103533. DOI: 10.1016/j.system.2024.103533.
[4] D. Putri, T. Sumarni, I. D. Sukmawati, and B. Kris tanto, (2025)“Foreignlanguageanxietyandnursing clinical communication competence in Indonesian preelement arynursing students: A mixed methods study Journal of Language and Pragmatics Studies 4(1): 1–13.
[5] Y. B. Zheng, Y. X. Zhou, X. D. Chen, and X. D. Ye, (2025) “The influence of large language models as collaborative dialogue partners on EFL English oral proficiency and foreign language anxiety" Computer Assisted Language Learning 1: 1–27. DOI: 10.1080/09588221.2025.2453191.
[6] C. Zhang, G. Shan, J. Lim, and B.-H. Roh, (2025) “Dynamic reinforcement learning for optimal Go AI training: Adaptive adjustment and optimization" IEEE Transactions on Consumer Electronics 71(1): 292–302.
[7] R.Jin,(2025)“Analysis of EFL learners’ academic anxiety based on R language and graph neural network" International Journal of Information and Communication Technology 26: 1–16.
[8] H.Sun, (2025) “Deep neural network adaptive learning model design for English literacy instruction" Journal of Combinatorial Mathematics and Combinatorial Computing 127: 3177–3195.
[9] Y. Yang, (2024) “Feature fusion: Research on emotion recognition in English speech" International Journal of Speech Technology 27(2): 319–327.
[10] F. Hongli and G. C. Shuang, (2024) “A deep learning based blended model for enhancing English proficiency in sports students through athlete training modules" Re vista de Psicología del Deporte 33(4): 224–233.
[11] X. Jin, C. Zhang, Y. Wang, and T. Huo, (2025) “Affective computing-driven optimization methods for adaptive foreign language learning systems research and empirical validation" International Journal of Human Computer Interaction 1: 1–21.
[12] J. Yiling, M. Omar, and F. M. Kamaruzaman, (2025) “Exploring the AI-enhanced project-based learning for English language acquisition: A systematic review of the key elements and emerging technology trends" International Journal of Learning, Teaching and Educational Re search 24(2): 636–652.
[13] A. Batra, P. Saraswat, R. Agrawal, and K. Yadav, (2024) “Employing neural networks for speech recognition" Proceedings of the IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT) 5: 1279–1283.
[14] A. M. Morales-Rodríguez and F. M. Morales Rodríguez, (2024) “Effectiveness of a mindfulness-based intervention program to improve communication and stress coping skills in university students" European Journal of Investigation in Health, Psychology and Education 14(7): 1927–1939.
[15] Q. Dai, (2025) “A study on the effectiveness of mobile ap plications in improving spoken Chinese ability" International Journal of Web-Based Learning and Teaching Technologies 20(1): 1–16.
[16] M. A. Jahin, M. S. H. Shovon, M. F. Mridha, M. R. Islam, and Y. Watanobe, (2024) “A hybrid transformer and attention based recurrent neural network for robust and interpretable sentiment analysis of tweets" Scientific Reports 14(1): 24882.
[17] R. Geethanjali and A. Valarmathi, (2024) “A novel hybrid deep learning IChOA-CNN-LSTM model for modality-enriched and multilingual emotion recognition in social media" Scientific Reports 14(1): 22270.
[18] Q. Hu, Y. Peng, and Z. Zheng, (2025) “A deep learning framework for gender sensitive speech emotion recognition based on MFCC feature selection and SHAP analysis" Scientific Reports 15(1): 28569.
[19] M. Li, (2024) “Research on rumor recognition in the medical and health field based on the neural network model" Advances in Engineering Technology Re search 12(1): 827.
[20] G. M.DarandR.Delhibabu, (2024) “Speech databases, speech features, and classifiers in speech emotion recognition: A review" IEEE Access 12: 151122–151152.
[21] Q. Zou, (2025) “English learning anxiety under blended learning mode: A quantitative study of non-English major undergraduates in a Chinese university" European Journal of Psychology of Education 40(2): 1–30.
[22] Y. Zhang, D. Xu, T. Wang, K. Yang, X. Yao, M. Cheng, and D. Ge, (2024) “The intercultural communication competence improvement for pre-service CSL teachers: A blended learning method based on SVVR" Human Systems Management 43(5): 789–804.
[23] E.A.Alkhamali,A.Allinjawi,andR.B.Ashari,(2024) “Combining transformer, convolutional neural network, and long short-term memory architectures: A novel ensemble learning technique that leverages multi-acoustic features for speech emotion recognition in distance education classrooms" Applied Sciences 14(12): 5050.
[24] E. Gkintoni, A. Aroutzidis, H. Antonopoulou, and C. Halkiopoulos, (2025) “From neural networks to emotional networks: A systematic review of EEG-based emotion recognition in cognitive neuroscience and real-world applications" Brain Sciences 15(3): 220.
[25] J.Zhu,(2024)“Thepsychological mechanism of acquiring English as a second language" Advances in Education, Humanities and Social Science Research 12(1): 753.
We use cookies on this website to personalize content to improve your user experience and analyze our traffic. By using this site you agree to its use of cookies.