Cross-modal style transfer guided by fine-grained text based on capsule network

Yueping  Wang

doi:10.6180/jase.202605_29(5).0009

Cross-modal style transfer guided by fine-grained text based on capsule network

Research Categories

Yueping WangThis email address is being protected from spambots. You need JavaScript enabled to view it.

Department of Basic, Zhengzhou University of Science and Technology, Zhengzhou 450064 China

Received: July 20, 2025
Accepted: August 21, 2025
Publication Date: September 6, 2025

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.

Download Citation: ||https://doi.org/10.6180/jase.202605_29(5).0009

This paper proposes a fine-grained text-guided cross-modal style transfer framework based on capsule networks, addressing the shortcomings of existing methods in multi-modal feature alignment, style decoupling, and fine-grained control. Firstly, a text-image collaborative encoder is constructed, using the dynamic routing capsule network to encode text attributes into a set of interpretable style-content capsules, achieving explicit separation of style and content. Secondly, a cross-modal attention style injection module is designed, using the routing coefficients between capsules to precisely map text style information to the image content representation, supporting fine-grained adjustment of local attributes such as color, texture, and brushstrokes. Finally, contrastive learning constraints are introduced to ensure consistency and authenticity of style transfer. Empirical results on public benchmarks demonstrate that the proposed approach achieves markedly better style controllability, content preservation, and visual fidelity than state-of-the-art competitors, opening a novel avenue for steerable cross-modal generation.

Keywords: Cross-modal style transfer, fine-grained text, capsule network, contrastive learning constraint

[1] G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang. “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training”. In: Proceedings of the AAAI conference on artificial intelligence. 34. 07. 2020, 11336–11344. DOI: 10.1609/aaai.v34i07.6795.
[2] T. Wang, W. Jiang, Z. Lu, F. Zheng, R. Cheng, C. Yin, and P. Luo. “Vlmixer: Unpaired vision-language pre training via cross-modal cutmix”. In: International Conference on Machine Learning. PMLR. 2022, 22680 22690. DOI: 10.48550/arXiv.2206.08919.
[3] S. Yin, H. Li, A. A. Laghari, T. R. Gadekallu, G. A. Sampedro, and A. Almadhor, (2024) “An anomaly detection model based on deep auto-encoder and capsule graph convolution via sparrow search algorithm in 6G Internet of Everything" IEEE Internet of Things Journal 11(18): 29402–29411. DOI: 10.1109/JIOT.2024.3353337.
[4] C. Wang, C. Xu, X. Yao, and D. Tao, (2019) “Evolutionary generative adversarial networks" IEEE Transactions on Evolutionary Computation 23(6): 921–934. DOI: 10.1109/TEVC.2019.2895748.
[5] O. Tov, Y. Alaluf, Y. Nitzan, O. Patashnik, and D. Cohen-Or, (2021) “Designing an encoder for stylegan image manipulation" ACM Transactions on Graphics (TOG) 40(4): 1–14. DOI: 10.1145/3450626.345983.
[6] Y. Li, H. Wang, Y. Duan, J. Zhang, and X. Li, (2025) “Acloser look at the explainability of Contrastive language image pre-training" Pattern Recognition 162: 111409. DOI: 10.1016/j.patcog.2025.111409.
[7] D. Filipovski and S. Gievska. “SemanticStyleGAN: Generative Image Inpainting Using Style-Based Generator”. In: 2021 International Conference on ICT Innovations. Springer. 2021, 38–51. DOI: 10.1007/978-3-031-04206-5_4.
[8] T. Wei, D. Chen, W. Zhou, J. Liao, Z. Tan, L. Yuan, W. Zhang, and N. Yu. “Hairclip: Design your hair by text and reference image”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 18072–18081. DOI: 10.1109/CVPR52688.2022.01754.
[9] H. Li, Z. Li, X. Wang, M. Ibrar, and X. Zhu. (2024) “Multi-keyword Ciphertext Sorting Search Based on Con-formation Graph Convolution Model and Transformer Network in English Education” International Journal of Network Security 26(4): 555–564. DOI: 10.6633/IJJNS.202407_26(4).03.
[10] L. Zhang, H. Li, R. Zhu, and P. Du. (2022) “An infrared and visible image fusion algorithm based on ResNet-152” Multimedia Tools and Applications 81(7): 9277–9287. DOI: 10.1007/s11042-021-11549-w.
[11] M. K. Panda, B. N. Subudhi, T. Veerakumar, and V. Jakhetiya. (2023) “Modified ResNet-152 network with hybrid pyramidal pooling for local change detection” IEEE Transactions on Artificial Intelligence 5(4): 1599–1612. DOI: 10.1109/TAI.2023.3299903.
[12] Y. Jiang and S. Yin. (2023) “Heterogenous-view occluded expression data recognition based on cycle-consistent ad-versarial network and K-SVD dictionary learning under intelligent cooperative robot environment” Computer Science and Information Systems 20(4): 1869–1883. DOI: 10.2298/CSIS221228034J.
[13] A. Fateh, M. Fateh, and V. Abolghasemi. (2024) “En-hancing optical character recognition: Efficient techniques for document layout analysis and text line detection” En-gineering Reports 6(9): e12832. DOI: 10.1002/eng2.12832.
[14] T. Yao, Z. Zhai, and B. Gao. “Text classification model based on fasttext”. In: 2020 IEEE International conference on artificial intelligence and information systems (ICAIIS). IEEE. 2020, 154–157. DOI: 10.1109/ICAIIS49377.2020.9194939.
[15] Z. Sun, H. Li, X. Wu, Y. Zhang, R. Guo, B. Wang, and L. Dong. “A dataset for generating chinese landscape painting”. In: 2023 International Conference on Culture-Oriented Science and Technology (COST). IEEE. 2023, 198–203. DOI: 10.1109/CoST60524.2023.00048.
[16] W. Xia, Y. Yang, J.-H. Xue, and B. Wu. “Tedigan: Text-guided diverse face image generation and manip-ulation”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, 2256–2265. DOI: 10.1109/CVPR46437.2021.00229.
[17] G. KwonandJ.C. Ye. “Clipstyler: Image style trans fer with a single text condition”. In: Proceedings of the IEEE/CVF conference on computer vision and pat tern recognition. 2022, 18062–18071. DOI: 10.1109/ CVPR52688.2022.01753.
[18] A. Alanov, V. Titov, and D. P. Vetrov, (2022) “Hyperdomainnet: Universal domain adaptation for generative adversarial networks" Advances in Neural Information Processing Systems 35: 29414–29426. DOI: abs/10.5555/3600270.3602403.