Multimodal Telegram Bot Based on LLM Orchestrator: Architecture, Economics of Limits and Impact on User Experience
Abstract
Multimodal chatbots on the Telegram platform, orchestrated by a Large Language Model (LLM), fuse text, image and speech processing, filling the gap for natural multi-channel interaction.
Purpose. To design and analyse such a chatbot architecture, identify resource constraints — “the economy of limits” — and evaluate their impact on user experience.
Method. After a literature review (2023–2025) was created a prototype (Python + Telegram Bot API) based on GPT‑4-class LLM orchestrator, computervision, ASR/TTS and Retrieval-Augmented Generation modules. A test set of 1,500 queries (text, image, voice) was evaluated for latency, token cost, answer accuracy and user satisfaction (SUS scale).
Results. Dynamic model routing and context compression cut average token expenditure by 41%; multimodal responses raised SUS from 72 to 84; 95th-percentile response time held at 6.8 s. A hybrid knowledge store reduced hallucinations by 36%.
Conclusion. Well-designed LLM orchestration and efficient resource management (context window, pricing tiers, throughput) significantly enhance the quality and reliability of a multimodal Telegram bot while keeping costs under control; recommendations are transferable to both corporate and public AI assistants.
About the Authors
D. A. ZaitsevRussian Federation
Denis A. Zaitsev -Student of the Faculty of International Economic Relations
Moscow
A. V. Prudnikov
Russian Federation
Andrey V. Prudnikov - Student of the Faculty of International Economic Relations
Moscow
M. B. Khripunova
Russian Federation
Marina B. Khripunova - Cand. Sci. (Phys. and Math.), Ass. Prof., Ass. Prof. of Department of Mathematics and Data Analysis
Moscow
L. A. Shmeleva
Russian Federation
Lyudmila A. Shmeleva - Cand. Sci. (Econ.), Assoc. Prof., Assoc. Prof. of the Department of Operational and Industry Management, Faculty of Higher School of Management
Moscow
References
1. Shen Y., Song K., Tan X., Li D., Lu W., Zhuang Y. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. arXiv. 2023;2303.17580. DOI: 10.48550/arXiv.2303.17580
2. Wu C., Yin S., Qi W., Wang X., Tang Z., Duan N. Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv. 2023;2303.04671. DOI: 10.48550/arXiv.2303.04671
3. Xu Y., Gao W., Wang Y., Shan X., Lin Y-S. Enhancing user experience and trust in advanced LLM-based conversational agents. Computing and Artificial Intelligence. 2024;2(2). DOI: 10.59400/cai.v2i2.1467
4. De Wynter A., Wang X., Sokolov A., Gu Q., Chen S-Q. An evaluation of large language model outputs: discourse and memorization. Natural Language Processing. 2023;4:100024. DOI: 10.1016/j.nlp.2023.100024
5. Li X., Zhang R., Xu X. Toolformer 2.0: self-augmenting large language models with multimodal tools. arXiv. 2024;2405.11223. DOI: 10.48550/arXiv.2405.11223
6. Kibkalo M., Shevtsov M., Gusev I. Gemini Ultra: multimodal generative model performance evaluation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). 2025. DOI: 10.18653/v1/2025.acl-main.219
7. Zhang Z. et. al. “It’s a Fair Game”, or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents. 2024. URL: https://adalerner.com/ZhangCHI2024-FairGame.pdf DOI: 10.1145/3613904.3642385
Review
For citations:
Zaitsev D.A., Prudnikov A.V., Khripunova M.B., Shmeleva L.A. Multimodal Telegram Bot Based on LLM Orchestrator: Architecture, Economics of Limits and Impact on User Experience. Digital Solutions and Artificial Intelligence Technologies. 2025;1(2):6-17. (In Russ.)
