Modern Tokenization Methods for Text Processing in the Financial Domain
https://doi.org/10.26794/3033-7097-2025-1-3-19-29
Abstract
The paper discusses tokenization as a key step in textual data processing, especially in the financial domain. Current tokenization techniques are analyzed with examples from recent research and their impact on the performance of NLP models. The study shows that word-based tokenization algorithms (BPE, WordPiece, Unigram) have become the standard for language models due to their flexibility and text compression efficiency. We discuss the limitations of input sequence length in language models (BPE and WordPiece show a tendency to over-partition, Unigram requires complex training, and symbolic tokenisation creates excessively long sequences) and methods to overcome these limitations, including text partitioning, hierarchical processing and extrapolation of pre-trained models with Transformer architecture to handle long input data. For financial data, it is recommended to use domain-specific tokenizers or additional training on specialized systems, which is confirmed by the successful experience of BloomberGPT. Special attention is paid to the problem of processing long texts. Three solution approaches are proposed: text partitioning; hierarchical processing; extrapolation of transformer models. In conclusion, the importance of tokenization for financial analytics is emphasized, where the quality of text processing directly affects decision-making. The development of tokenization methods continues in parallel with the improvement of NLP models, which makes this stage of text processing a critical component of modern analytical systems.
About the Authors
E. F. BoltachevRussian Federation
Eldar F. Boltachev — PhD (Tech.), Assoc. Prof., Department of Artificial Intelligence, Faculty of Information Technology and Big Data Analysis; Center for Digital Transformation and Artificial Intelligence
Moscow
M. P. Farhadov
Russian Federation
Mais P. Farhadov — PhD (Tech.), Senior Researcher, Head of the “Ergatic Systems” Laboratory
Moscow
A. I. Tyulyakov
Russian Federation
Alexander I. Tyulyakov — student of the Faculty of Information Technology and Big Data Analysis
Moscow
References
1. Pankratova M. D., Skovpen T. N. NLP Models Using Neural Networks for Sentiment Analysis of News. Analytical Technologies in the Social Sphere: Theory and Practice. 2023;97–107. URL: https://www.elibrary.ru/ctabku
2. Markov A. K., Semenochkin D. O., Kravets A. G., Yanovsky T. A. Comparative Analysis of Applied Natural Language Processing Technologies for Improving the Quality of Digital Document Classification. International Journal of Information Technologies. 2024;12(3):66–77. URL: https://www.elibrary.ru/tubosi
3. Araci D. FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models. arXiv preprint arXiv:1908.10063.2019:7. DOI: 10.48550/arXiv.1908.10063
4. Jaiswal A., Milios E. Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT. arXiv preprint arXiv:2310.20558. 2023:13. DOI: 10.48550/arXiv.2310.20558
5. Condevaux C., Harispe S. LSG Attention: Extrapolation of Pretrained Transformers to Long Sequences. Advances in Knowledge Discovery and Data Mining (PAKDD). 2023;13935 LNCS:443–454. DOI: 10.48550/arXiv.2210.15497
6. Sennrich R., Haddow B., Birch A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL. 2016;1715–1725. DOI: 10.48550/arXiv.1508.07909
7. Bostrom K., Durrett G. Byte Pair Encoding is Suboptimal for Language Model Pretraining. Proceedings of EMNLP. 2020:461–466. DOI: 10.18653/v1/2020.findings-emnlp.414
8. Wu S., et al. BloombergGPT: A Large Language Model for Finance. arXiv preprint arXiv:2303.17564.2023:63. DOI: 10.48550/arXiv.2303.17564
9. Yang Z., et al. Hierarchical Attention Networks for Document Classification. Proceedings of NAACL. 2016:1480–1489. DOI: 10.18653/v1/N16–1174
10. Beltagy I., Peters M. E., Cohan A. Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150.2020:12. DOI: 10.48550/arXiv.2004.05150
11. Zaheer M., et al. Big Bird: Transformers for Longer Sequences. Advances in Neural Information Processing Systems (NeurIPS). 2020;33:17283–17297. DOI: 10.48550/arXiv.2007.14062
12. Dai Z., et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Proceedings of ACL. 2019;2978–2988. DOI: 10.48550/arXiv.1901.02860
13.
Review
For citations:
Boltachev E.F., Farhadov M.P., Tyulyakov A.I. Modern Tokenization Methods for Text Processing in the Financial Domain. Digital Solutions and Artificial Intelligence Technologies. 2025;1(3):19-29. (In Russ.) https://doi.org/10.26794/3033-7097-2025-1-3-19-29
