Preview

Digital Solutions and Artificial Intelligence Technologies

Advanced search

Dynamic Model of Attention in Transformers

https://doi.org/10.26794/3033-7097-2025-1-4-35-42

Abstract

The attention mechanism is a key component of modern artificial neural networks designed to process data of various nature. The article examines the dynamic of attention using a continuous model. In this model, attention is described as the movement of interacting tokens. It is shown that, under suitable assumptions, attention is Lipschitz continuous. In particular, Lipschitz continuity may be ensured by token normalization. The dynamics of transformers is modelled by a system of differential equations. Lipschitz continuity guarantees that there exists a solution to this system. The purpose of the study is to investigate the behavior of tokens that make up promt under an unlimited increasing in the number of transformer layers. For one-dimensional tokens, a qualitative description of the trajectories of tokens and the dynamics of the attention matrix is given. It is shown that if a token goes beyond a fairly narrow corridor at some point (the width is on the order of the logarithm of the promt size), this token tends to infinity (positive or negative, depending on which border the exit occurred). The research methodology is based on continuous parameterization of the attention matrix. The common representation of transformer dynamics by difference equations has been replaced by a representation using systems of ordinary differential equations. A huge number of publications are devoted to the description and study of transformers, but most of them do not contain accurate mathematical descriptions of architecture. This article attempts to give a mathematically meaningful and at the same time fairly simple description of attention. The description dynamics of 1-d tokens is certainly much simpler than the dynamics of multidimensional tokens. Nevertheless, this description gives an idea of the behavior of transformers in a more general situation creates a framework for future investigation.

About the Author

V. B. Gisin
Financial University under the Governement of the Russian Federation
Россия

Vladimir B. Gisin — Cand. Sci. (Phys. and Math.), Professor of the Mathematics and Data Analysis Department of the Faculty of Information Technology and Big Data Analysis

Moscow



References

1. Vaswani A., Shazeer, N., Parmar N., Uszkoreit J., Jones L., Gomez A. N. Kaiser Ł., Polosukhin I. Attention is all you need. In: Guyon I., Von Luxburg U., S. Bengio, et al, eds. Neural Information Processing Systems. 2017;30:5998–6008. URL: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

2. Rambelli G., Chersoni E., Testa D., Blache, P., Lenci A. Neural generative models and the parallel architecture of language: A critical review and outlook. Topics in Cognitive Science. 2024;17(4):948–961. DOI: 10.1111/tops.12733

3. Turner R. E. An introduction to transformers. ArXiv preprint. 2023; arXiv:2304.10557. DOI: 10.48550/arxiv.2304.10557

4. Amatriain X., Sankar A., Bing J., Bodigutla P. K., Hazen T. J., Kazi M. Transformer models: an introduction and catalog. ArXiv preprint. 2023; arXiv:2302.07730. DOI: 10.48550/arXiv.2302.07730

5. He S., Sun G., Shen Z., Li A. What matters in transformers? Not all attention is needed. ArXiv preprint. 2024; arXiv:2406.15786. DOI: 10.48550/arXiv.2406.15786

6. Passi N., Raj M., Shelke N. A. A review on transformer models: applications, taxonomies, open issues and challenges. 4th Asian Conference on Innovation in Technology (ASIANCON). IEEE, 2024;1–6. DOI: 10.1109/ASIANCON 62057.2024.10838047

7. Joshi S. Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies. Preprint. 2025; DOI: 10.20944/preprints202504.0369.v1

8. Sajun A. R., Zualkernan I., Sankalpa D. A historical survey of advances in transformer architectures. Applied Sciences. 2024;14(10):4316. DOI: 10.3390/app14104316

9. Canchila S., Meneses-Eraso C., Casanoves-Boix J., Cortés-Pellicer P., Castelló-Sirvent F. Natural Language Processing: An Overview of Models, Transformers and Applied Practices. Computer Science and Information Systems. 2024;21(3):1097–1145. DOI: 10.2298/CSIS 230217031C

10. Ali A., Schnake T., Eberle O., Montavon G., Müller K. R., Wolf L. XAI for transformers: Better explanations through conservative propagation. International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR). 2022;435–451. DOI: 10.48550/arXiv.2202.07304

11. Dufter P., Schmitt M., Schütze H. Position information in transformers: An overview. Computational Linguistics. 2022;48(3):733–763. DOI: 10.1162/coli_a_00445

12. Geshkovski B. Letrouit C., Polyanskiy Y., Rigollet P. The emergence of clusters in self-attention dynamics. Advances in Neural Information Processing Systems. 2023;36:57026–57037. DOI: 10.48550/arXiv.2305.05465

13. Sander M. E., Ablin P., Blondel M., & Peyré G. Sinkformers: Transformers with doubly stochastic attention. International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, 2022:3515–3530. DOI: 10.48550/arXiv.2110.11773

14. Kim H., Papamakarios G., Mnih A. The Lipschitz constant of self-attention. International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;5562–5571. DOI: 10.48550/arXiv.2006.04710

15. Geshkovski B., Letrouit C., Polyanskiy Y., Rigollet P. A mathematical perspective on transformers. Bulletin of the American Mathematical Society. 2025;62(3):427–479. DOI: 10.1090/bull/1863

16. Lu Y., Li Z., He D., et al. Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv preprint. 2019; arXiv:1906.02762. DOI: 10.48550/arXiv.1906.02762


Review

For citations:


Gisin V.B. Dynamic Model of Attention in Transformers. Digital Solutions and Artificial Intelligence Technologies. 2025;1(4):35-42. (In Russ.) https://doi.org/10.26794/3033-7097-2025-1-4-35-42

Views: 41

JATS XML


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 3033-7097 (Online)