<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">dsait</journal-id><journal-title-group><journal-title xml:lang="ru">Цифровые решения и технологии искусственного интеллекта</journal-title><trans-title-group xml:lang="en"><trans-title>Digital Solutions and Artificial Intelligence Technologies</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">3033-7097</issn><publisher><publisher-name>Финансовый университет при Правительстве Российской Федерации</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.26794/3033-7097-2025-1-4-35-42</article-id><article-id custom-type="elpub" pub-id-type="custom">dsait-29</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>МАТЕМАТИЧЕСКОЕ МОДЕЛИРОВАНИЕ, ЧИСЛЕНЫЕ МЕТОДЫ И КОМПЛЕКСЫ ПРОГРАММ</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>MATHEMATICAL MODELING, NUMERICAL METHODS AND SOFTWARE PACKAGES</subject></subj-group></article-categories><title-group><article-title>Динамическая модель внимания в трансформерах</article-title><trans-title-group xml:lang="en"><trans-title>Dynamic Model of Attention in Transformers</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-7269-0587</contrib-id><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Гисин</surname><given-names>В. Б.</given-names></name><name name-style="western" xml:lang="en"><surname>Gisin</surname><given-names>V. B.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Владимир Борисович Гисин — кандидат физико-математических наук, профессор, профессор кафедры математики и анализа данных факультета информационных технологий и анализа больших данных</p><p>Москва</p></bio><bio xml:lang="en"><p>Vladimir B. Gisin — Cand. Sci. (Phys. and Math.), Professor of the Mathematics and Data Analysis Department of the Faculty of Information Technology and Big Data Analysis</p><p>Moscow</p></bio><email xlink:type="simple">vgisin@fa.ru</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Финансовый университет при Правительстве Российской Федерации</institution></aff><aff xml:lang="en"><institution>Financial University under the Governement of the Russian Federation</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2025</year></pub-date><pub-date pub-type="epub"><day>23</day><month>01</month><year>2026</year></pub-date><volume>1</volume><issue>4</issue><fpage>35</fpage><lpage>42</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Гисин В.Б., 2026</copyright-statement><copyright-year>2026</copyright-year><copyright-holder xml:lang="ru">Гисин В.Б.</copyright-holder><copyright-holder xml:lang="en">Gisin V.B.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://www.digitarin.ru/jour/article/view/29">https://www.digitarin.ru/jour/article/view/29</self-uri><abstract><p>Механизм внимания является основой трансформеров, ключевого компонента современных искусственных нейронных сетей, используемых при работе с данными различной природы. В статье изучается динамическая модель механизма внимания. В рамках этой модели внимание описывается как движение взаимодействующих токенов. Показано, что при подходящих предположениях внимание непрерывно по Липшицу. В частности, непрерывность по Липшицу обеспечивает нормирование токенов. Это служит основанием для исследования решений систем дифференциальных уравнений, описывающих динамику трансформеров. Целью исследования является изучение особенностей поведения токенов, составляющих промт, при неограниченном увеличении числа слоев трансформера. В одномерном случае приведено качественное описание траекторий токенов и динамики матрицы внимания. Показано, что если токен в некоторый момент времени выходит за границу достаточно узкого коридора (ширины порядка логарифма размера промта), то этот токен в дальнейшем стремится к бесконечности (положительной или отрицательной в зависимости от того, через какую границу произошел выход). Методология исследования базируется на непрерывной параметризации матрицы внимания. Распространенное представление динамики трансформеров разностными уравнениями заменено представлением с помощью систем обыкновенных дифференциальных уравнений. Описанию и изучению трансформеров посвящено огромное число публикаций, но большинство из них не содержат точных математических описаний архитектуры. В этой статье сделана попытка дать математически точное и при этом достаточно простое описание динамики трансформеров. Динамика токенов в одномерном случае, безусловно, значительно проще, чем динамика многомерных токенов. Тем не менее она дает представление о поведении трансформеров и в более общей ситуации создает каркас из точных формулировок.</p></abstract><trans-abstract xml:lang="en"><p>The attention mechanism is a key component of modern artificial neural networks designed to process data of various nature. The article examines the dynamic of attention using a continuous model. In this model, attention is described as the movement of interacting tokens. It is shown that, under suitable assumptions, attention is Lipschitz continuous. In particular, Lipschitz continuity may be ensured by token normalization. The dynamics of transformers is modelled by a system of differential equations. Lipschitz continuity guarantees that there exists a solution to this system. The purpose of the study is to investigate the behavior of tokens that make up promt under an unlimited increasing in the number of transformer layers. For one-dimensional tokens, a qualitative description of the trajectories of tokens and the dynamics of the attention matrix is given. It is shown that if a token goes beyond a fairly narrow corridor at some point (the width is on the order of the logarithm of the promt size), this token tends to infinity (positive or negative, depending on which border the exit occurred). The research methodology is based on continuous parameterization of the attention matrix. The common representation of transformer dynamics by difference equations has been replaced by a representation using systems of ordinary differential equations. A huge number of publications are devoted to the description and study of transformers, but most of them do not contain accurate mathematical descriptions of architecture. This article attempts to give a mathematically meaningful and at the same time fairly simple description of attention. The description dynamics of 1-d tokens is certainly much simpler than the dynamics of multidimensional tokens. Nevertheless, this description gives an idea of the behavior of transformers in a more general situation creates a framework for future investigation.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>искусственный интеллект</kwd><kwd>нейронная сеть</kwd><kwd>трансформер</kwd><kwd>механизм внимания</kwd><kwd>траектория</kwd><kwd>взаимодействие токенов</kwd></kwd-group><kwd-group xml:lang="en"><kwd>artificial intelligence</kwd><kwd>neural network</kwd><kwd>transformer</kwd><kwd>mechanism of attention</kwd><kwd>trajectory</kwd><kwd>interaction of tokens</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Vaswani A., Shazeer, N., Parmar N., Uszkoreit J., Jones L., Gomez A. N. Kaiser Ł., Polosukhin I. Attention is all you need. In: Guyon I., Von Luxburg U., S. Bengio, et al, eds. Neural Information Processing Systems. 2017;30:5998–6008. URL: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf</mixed-citation><mixed-citation xml:lang="en">Vaswani A., Shazeer, N., Parmar N., Uszkoreit J., Jones L., Gomez A. N. Kaiser Ł., Polosukhin I. Attention is all you need. In: Guyon I., Von Luxburg U., S. Bengio, et al, eds. Neural Information Processing Systems. 2017;30:5998–6008. URL: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Rambelli G., Chersoni E., Testa D., Blache, P., Lenci A. Neural generative models and the parallel architecture of language: A critical review and outlook. Topics in Cognitive Science. 2024;17(4):948–961. DOI: 10.1111/tops.12733</mixed-citation><mixed-citation xml:lang="en">Rambelli G., Chersoni E., Testa D., Blache, P., Lenci A. Neural generative models and the parallel architecture of language: A critical review and outlook. Topics in Cognitive Science. 2024;17(4):948–961. DOI: 10.1111/tops.12733</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Turner R. E. An introduction to transformers. ArXiv preprint. 2023; arXiv:2304.10557. DOI: 10.48550/arxiv.2304.10557</mixed-citation><mixed-citation xml:lang="en">Turner R. E. An introduction to transformers. ArXiv preprint. 2023; arXiv:2304.10557. DOI: 10.48550/arxiv.2304.10557</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Amatriain X., Sankar A., Bing J., Bodigutla P. K., Hazen T. J., Kazi M. Transformer models: an introduction and catalog. ArXiv preprint. 2023; arXiv:2302.07730. DOI: 10.48550/arXiv.2302.07730</mixed-citation><mixed-citation xml:lang="en">Amatriain X., Sankar A., Bing J., Bodigutla P. K., Hazen T. J., Kazi M. Transformer models: an introduction and catalog. ArXiv preprint. 2023; arXiv:2302.07730. DOI: 10.48550/arXiv.2302.07730</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">He S., Sun G., Shen Z., Li A. What matters in transformers? Not all attention is needed. ArXiv preprint. 2024; arXiv:2406.15786. DOI: 10.48550/arXiv.2406.15786</mixed-citation><mixed-citation xml:lang="en">He S., Sun G., Shen Z., Li A. What matters in transformers? Not all attention is needed. ArXiv preprint. 2024; arXiv:2406.15786. DOI: 10.48550/arXiv.2406.15786</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Passi N., Raj M., Shelke N. A. A review on transformer models: applications, taxonomies, open issues and challenges. 4th Asian Conference on Innovation in Technology (ASIANCON). IEEE, 2024;1–6. DOI: 10.1109/ASIANCON 62057.2024.10838047</mixed-citation><mixed-citation xml:lang="en">Passi N., Raj M., Shelke N. A. A review on transformer models: applications, taxonomies, open issues and challenges. 4th Asian Conference on Innovation in Technology (ASIANCON). IEEE, 2024;1–6. DOI: 10.1109/ASIANCON 62057.2024.10838047</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Joshi S. Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies. Preprint. 2025; DOI: 10.20944/preprints202504.0369.v1</mixed-citation><mixed-citation xml:lang="en">Joshi S. Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies. Preprint. 2025; DOI: 10.20944/preprints202504.0369.v1</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Sajun A. R., Zualkernan I., Sankalpa D. A historical survey of advances in transformer architectures. Applied Sciences. 2024;14(10):4316. DOI: 10.3390/app14104316</mixed-citation><mixed-citation xml:lang="en">Sajun A. R., Zualkernan I., Sankalpa D. A historical survey of advances in transformer architectures. Applied Sciences. 2024;14(10):4316. DOI: 10.3390/app14104316</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Canchila S., Meneses-Eraso C., Casanoves-Boix J., Cortés-Pellicer P., Castelló-Sirvent F. Natural Language Processing: An Overview of Models, Transformers and Applied Practices. Computer Science and Information Systems. 2024;21(3):1097–1145. DOI: 10.2298/CSIS 230217031C</mixed-citation><mixed-citation xml:lang="en">Canchila S., Meneses-Eraso C., Casanoves-Boix J., Cortés-Pellicer P., Castelló-Sirvent F. Natural Language Processing: An Overview of Models, Transformers and Applied Practices. Computer Science and Information Systems. 2024;21(3):1097–1145. DOI: 10.2298/CSIS 230217031C</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Ali A., Schnake T., Eberle O., Montavon G., Müller K. R., Wolf L. XAI for transformers: Better explanations through conservative propagation. International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR). 2022;435–451. DOI: 10.48550/arXiv.2202.07304</mixed-citation><mixed-citation xml:lang="en">Ali A., Schnake T., Eberle O., Montavon G., Müller K. R., Wolf L. XAI for transformers: Better explanations through conservative propagation. International Conference on Machine Learning. Proceedings of Machine Learning Research (PMLR). 2022;435–451. DOI: 10.48550/arXiv.2202.07304</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Dufter P., Schmitt M., Schütze H. Position information in transformers: An overview. Computational Linguistics. 2022;48(3):733–763. DOI: 10.1162/coli_a_00445</mixed-citation><mixed-citation xml:lang="en">Dufter P., Schmitt M., Schütze H. Position information in transformers: An overview. Computational Linguistics. 2022;48(3):733–763. DOI: 10.1162/coli_a_00445</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Geshkovski B. Letrouit C., Polyanskiy Y., Rigollet P. The emergence of clusters in self-attention dynamics. Advances in Neural Information Processing Systems. 2023;36:57026–57037. DOI: 10.48550/arXiv.2305.05465</mixed-citation><mixed-citation xml:lang="en">Geshkovski B. Letrouit C., Polyanskiy Y., Rigollet P. The emergence of clusters in self-attention dynamics. Advances in Neural Information Processing Systems. 2023;36:57026–57037. DOI: 10.48550/arXiv.2305.05465</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Sander M. E., Ablin P., Blondel M., &amp; Peyré G. Sinkformers: Transformers with doubly stochastic attention. International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, 2022:3515–3530. DOI: 10.48550/arXiv.2110.11773</mixed-citation><mixed-citation xml:lang="en">Sander M. E., Ablin P., Blondel M., &amp; Peyré G. Sinkformers: Transformers with doubly stochastic attention. International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, 2022:3515–3530. DOI: 10.48550/arXiv.2110.11773</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Kim H., Papamakarios G., Mnih A. The Lipschitz constant of self-attention. International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;5562–5571. DOI: 10.48550/arXiv.2006.04710</mixed-citation><mixed-citation xml:lang="en">Kim H., Papamakarios G., Mnih A. The Lipschitz constant of self-attention. International Conference on Machine Learning. Proceedings of Machine Learning Research. 2021;5562–5571. DOI: 10.48550/arXiv.2006.04710</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Geshkovski B., Letrouit C., Polyanskiy Y., Rigollet P. A mathematical perspective on transformers. Bulletin of the American Mathematical Society. 2025;62(3):427–479. DOI: 10.1090/bull/1863</mixed-citation><mixed-citation xml:lang="en">Geshkovski B., Letrouit C., Polyanskiy Y., Rigollet P. A mathematical perspective on transformers. Bulletin of the American Mathematical Society. 2025;62(3):427–479. DOI: 10.1090/bull/1863</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Lu Y., Li Z., He D., et al. Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv preprint. 2019; arXiv:1906.02762. DOI: 10.48550/arXiv.1906.02762</mixed-citation><mixed-citation xml:lang="en">Lu Y., Li Z., He D., et al. Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv preprint. 2019; arXiv:1906.02762. DOI: 10.48550/arXiv.1906.02762</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
