References

dsait

Цифровые решения и технологии искусственного интеллекта

Digital Solutions and Artificial Intelligence Technologies

3033-7097

Финансовый университет при Правительстве Российской Федерации

10.26794/3033-7097-2025-1-4-35-42

dsait-29

Research Article

МАТЕМАТИЧЕСКОЕ МОДЕЛИРОВАНИЕ, ЧИСЛЕНЫЕ МЕТОДЫ И КОМПЛЕКСЫ ПРОГРАММ

MATHEMATICAL MODELING, NUMERICAL METHODS AND SOFTWARE PACKAGES

Динамическая модель внимания в трансформерах

Dynamic Model of Attention in Transformers

https://orcid.org/0000-0002-7269-0587

Гисин

В. Б.

Gisin

V. B.

Владимир Борисович Гисин — кандидат физико-математических наук, профессор, профессор кафедры математики и анализа данных факультета информационных технологий и анализа больших данных

Москва

Vladimir B. Gisin — Cand. Sci. (Phys. and Math.), Professor of the Mathematics and Data Analysis Department of the Faculty of Information Technology and Big Data Analysis

Moscow

vgisin@fa.ru

Финансовый университет при Правительстве Российской ФедерацииFinancial University under the Governement of the Russian Federation

2025

23012026

143542

2026

Гисин В.Б.

Gisin V.B.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://www.digitarin.ru/jour/article/view/29

Механизм внимания является основой трансформеров, ключевого компонента современных искусственных нейронных сетей, используемых при работе с данными различной природы. В статье изучается динамическая модель механизма внимания. В рамках этой модели внимание описывается как движение взаимодействующих токенов. Показано, что при подходящих предположениях внимание непрерывно по Липшицу. В частности, непрерывность по Липшицу обеспечивает нормирование токенов. Это служит основанием для исследования решений систем дифференциальных уравнений, описывающих динамику трансформеров. Целью исследования является изучение особенностей поведения токенов, составляющих промт, при неограниченном увеличении числа слоев трансформера. В одномерном случае приведено качественное описание траекторий токенов и динамики матрицы внимания. Показано, что если токен в некоторый момент времени выходит за границу достаточно узкого коридора (ширины порядка логарифма размера промта), то этот токен в дальнейшем стремится к бесконечности (положительной или отрицательной в зависимости от того, через какую границу произошел выход). Методология исследования базируется на непрерывной параметризации матрицы внимания. Распространенное представление динамики трансформеров разностными уравнениями заменено представлением с помощью систем обыкновенных дифференциальных уравнений. Описанию и изучению трансформеров посвящено огромное число публикаций, но большинство из них не содержат точных математических описаний архитектуры. В этой статье сделана попытка дать математически точное и при этом достаточно простое описание динамики трансформеров. Динамика токенов в одномерном случае, безусловно, значительно проще, чем динамика многомерных токенов. Тем не менее она дает представление о поведении трансформеров и в более общей ситуации создает каркас из точных формулировок.

The attention mechanism is a key component of modern artificial neural networks designed to process data of various nature. The article examines the dynamic of attention using a continuous model. In this model, attention is described as the movement of interacting tokens. It is shown that, under suitable assumptions, attention is Lipschitz continuous. In particular, Lipschitz continuity may be ensured by token normalization. The dynamics of transformers is modelled by a system of differential equations. Lipschitz continuity guarantees that there exists a solution to this system. The purpose of the study is to investigate the behavior of tokens that make up promt under an unlimited increasing in the number of transformer layers. For one-dimensional tokens, a qualitative description of the trajectories of tokens and the dynamics of the attention matrix is given. It is shown that if a token goes beyond a fairly narrow corridor at some point (the width is on the order of the logarithm of the promt size), this token tends to infinity (positive or negative, depending on which border the exit occurred). The research methodology is based on continuous parameterization of the attention matrix. The common representation of transformer dynamics by difference equations has been replaced by a representation using systems of ordinary differential equations. A huge number of publications are devoted to the description and study of transformers, but most of them do not contain accurate mathematical descriptions of architecture. This article attempts to give a mathematically meaningful and at the same time fairly simple description of attention. The description dynamics of 1-d tokens is certainly much simpler than the dynamics of multidimensional tokens. Nevertheless, this description gives an idea of the behavior of transformers in a more general situation creates a framework for future investigation.

искусственный интеллектнейронная сетьтрансформермеханизм вниманиятраекториявзаимодействие токенов

artificial intelligenceneural networktransformermechanism of attentiontrajectoryinteraction of tokens

References1

Vaswani A., Shazeer, N., Parmar N., Uszkoreit J., Jones L., Gomez A. N. Kaiser Ł., Polosukhin I. Attention is all you need. In: Guyon I., Von Luxburg U., S. Bengio, et al, eds. Neural Information Processing Systems. 2017;30:5998–6008. URL: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Rambelli G., Chersoni E., Testa D., Blache, P., Lenci A. Neural generative models and the parallel architecture of language: A critical review and outlook. Topics in Cognitive Science. 2024;17(4):948–961. DOI: 10.1111/tops.12733

Turner R. E. An introduction to transformers. ArXiv preprint. 2023; arXiv:2304.10557. DOI: 10.48550/arxiv.2304.10557

Amatriain X., Sankar A., Bing J., Bodigutla P. K., Hazen T. J., Kazi M. Transformer models: an introduction and catalog. ArXiv preprint. 2023; arXiv:2302.07730. DOI: 10.48550/arXiv.2302.07730

He S., Sun G., Shen Z., Li A. What matters in transformers? Not all attention is needed. ArXiv preprint. 2024; arXiv:2406.15786. DOI: 10.48550/arXiv.2406.15786

Passi N., Raj M., Shelke N. A. A review on transformer models: applications, taxonomies, open issues and challenges. 4th Asian Conference on Innovation in Technology (ASIANCON). IEEE, 2024;1–6. DOI: 10.1109/ASIANCON 62057.2024.10838047

Joshi S. Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies. Preprint. 2025; DOI: 10.20944/preprints202504.0369.v1

Sajun A. R., Zualkernan I., Sankalpa D. A historical survey of advances in transformer architectures. Applied Sciences. 2024;14(10):4316. DOI: 10.3390/app14104316

Canchila S., Meneses-Eraso C., Casanoves-Boix J., Cortés-Pellicer P., Castelló-Sirvent F. Natural Language Processing: An Overview of Models, Transformers and Applied Practices. Computer Science and Information Systems. 2024;21(3):1097–1145. DOI: 10.2298/CSIS 230217031C

Ali A., Schnake T., Eberle O., Montavon G., Müller K. R., Wolf L. XAI for transformers: Better explanations through conservative propagation. International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR). 2022;435–451. DOI: 10.48550/arXiv.2202.07304

Dufter P., Schmitt M., Schütze H. Position information in transformers: An overview. Computational Linguistics. 2022;48(3):733–763. DOI: 10.1162/coli_a_00445

Geshkovski B. Letrouit C., Polyanskiy Y., Rigollet P. The emergence of clusters in self-attention dynamics. Advances in Neural Information Processing Systems. 2023;36:57026–57037. DOI: 10.48550/arXiv.2305.05465

Sander M. E., Ablin P., Blondel M., & Peyré G. Sinkformers: Transformers with doubly stochastic attention. International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, 2022:3515–3530. DOI: 10.48550/arXiv.2110.11773

Kim H., Papamakarios G., Mnih A. The Lipschitz constant of self-attention. International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;5562–5571. DOI: 10.48550/arXiv.2006.04710

Geshkovski B., Letrouit C., Polyanskiy Y., Rigollet P. A mathematical perspective on transformers. Bulletin of the American Mathematical Society. 2025;62(3):427–479. DOI: 10.1090/bull/1863

Lu Y., Li Z., He D., et al. Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv preprint. 2019; arXiv:1906.02762. DOI: 10.48550/arXiv.1906.02762

The authors declare that there are no conflicts of interest present.