Preview

Digital Solutions and Artificial Intelligence Technologies

Advanced search

Modern Tokenization Methods for Text Processing in the Financial Domain

https://doi.org/10.26794/3033-7097-2025-1-3-19-29

Abstract

The paper discusses tokenization as a key step in textual data processing, especially in the financial domain. Current tokenization techniques are analyzed with examples from recent research and their impact on the performance of NLP models. The study shows that word-based tokenization algorithms (BPE, WordPiece, Unigram) have become the standard for language models due to their flexibility and text compression efficiency. We discuss the limitations of input sequence length in language models (BPE and WordPiece show a tendency to over-partition, Unigram requires complex training, and symbolic tokenisation creates excessively long sequences) and methods to overcome these limitations, including text partitioning, hierarchical processing and extrapolation of pre-trained models with Transformer architecture to handle long input data. For financial data, it is recommended to use domain-specific tokenizers or additional training on specialized systems, which is confirmed by the successful experience of BloomberGPT. Special attention is paid to the problem of processing long texts. Three solution approaches are proposed: text partitioning; hierarchical processing; extrapolation of transformer models. In conclusion, the importance of tokenization for financial analytics is emphasized, where the quality of text processing directly affects decision-making. The development of tokenization methods continues in parallel with the improvement of NLP models, which makes this stage of text processing a critical component of modern analytical systems.

About the Authors

E. F. Boltachev
Financial University under the Government of the Russian Federation
Russian Federation

Eldar F. Boltachev — PhD (Tech.), Assoc. Prof., Department of Artificial Intelligence, Faculty of Information Technology and Big Data Analysis; Center for Digital Transformation and Artificial Intelligence

Moscow



M. P. Farhadov
Trapeznikov Institute of Control Sciences of the Russian Academy of Sciences
Russian Federation

Mais P. Farhadov — PhD (Tech.), Senior Researcher, Head of the “Ergatic Systems” Laboratory

Moscow



A. I. Tyulyakov
Financial University under the Government of the Russian Federation
Russian Federation

Alexander I. Tyulyakov — student of the Faculty of Information Technology and Big Data Analysis

Moscow



References

1. Pankratova M. D., Skovpen T. N. NLP Models Using Neural Networks for Sentiment Analysis of News. Analytical Technologies in the Social Sphere: Theory and Practice. 2023;97–107. URL: https://www.elibrary.ru/ctabku

2. Markov A. K., Semenochkin D. O., Kravets A. G., Yanovsky T. A. Comparative Analysis of Applied Natural Language Processing Technologies for Improving the Quality of Digital Document Classification. International Journal of Information Technologies. 2024;12(3):66–77. URL: https://www.elibrary.ru/tubosi

3. Araci D. FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models. arXiv preprint arXiv:1908.10063.2019:7. DOI: 10.48550/arXiv.1908.10063

4. Jaiswal A., Milios E. Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT. arXiv preprint arXiv:2310.20558. 2023:13. DOI: 10.48550/arXiv.2310.20558

5. Condevaux C., Harispe S. LSG Attention: Extrapolation of Pretrained Transformers to Long Sequences. Advances in Knowledge Discovery and Data Mining (PAKDD). 2023;13935 LNCS:443–454. DOI: 10.48550/arXiv.2210.15497

6. Sennrich R., Haddow B., Birch A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL. 2016;1715–1725. DOI: 10.48550/arXiv.1508.07909

7. Bostrom K., Durrett G. Byte Pair Encoding is Suboptimal for Language Model Pretraining. Proceedings of EMNLP. 2020:461–466. DOI: 10.18653/v1/2020.findings-emnlp.414

8. Wu S., et al. BloombergGPT: A Large Language Model for Finance. arXiv preprint arXiv:2303.17564.2023:63. DOI: 10.48550/arXiv.2303.17564

9. Yang Z., et al. Hierarchical Attention Networks for Document Classification. Proceedings of NAACL. 2016:1480–1489. DOI: 10.18653/v1/N16–1174

10. Beltagy I., Peters M. E., Cohan A. Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150.2020:12. DOI: 10.48550/arXiv.2004.05150

11. Zaheer M., et al. Big Bird: Transformers for Longer Sequences. Advances in Neural Information Processing Systems (NeurIPS). 2020;33:17283–17297. DOI: 10.48550/arXiv.2007.14062

12. Dai Z., et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Proceedings of ACL. 2019;2978–2988. DOI: 10.48550/arXiv.1901.02860

13.


Review

For citations:


Boltachev E.F., Farhadov M.P., Tyulyakov A.I. Modern Tokenization Methods for Text Processing in the Financial Domain. Digital Solutions and Artificial Intelligence Technologies. 2025;1(3):19-29. (In Russ.) https://doi.org/10.26794/3033-7097-2025-1-3-19-29

Views: 34


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 3033-7097 (Online)