Modern Tokenization Methods for Text Processing in the Financial Domain

E. F. Boltachev; M. P. Farhadov; A. I. Tyulyakov

doi:10.26794/3033-7097-2025-1-3-19-29

Modern Tokenization Methods for Text Processing in the Financial Domain

E. F. Boltachev, M. P. Farhadov, A. I. Tyulyakov

https://doi.org/10.26794/3033-7097-2025-1-3-19-29

Full Text:

PDF (Rus)

Generate QR code

Abstract

The paper discusses tokenization as a key step in textual data processing, especially in the ﬁnancial domain. Current tokenization techniques are analyzed with examples from recent research and their impact on the performance of NLP models. The study shows that word-based tokenization algorithms (BPE, WordPiece, Unigram) have become the standard for language models due to their ﬂexibility and text compression efﬁciency. We discuss the limitations of input sequence length in language models (BPE and WordPiece show a tendency to over-partition, Unigram requires complex training, and symbolic tokenisation creates excessively long sequences) and methods to overcome these limitations, including text partitioning, hierarchical processing and extrapolation of pre-trained models with Transformer architecture to handle long input data. For ﬁnancial data, it is recommended to use domain-speciﬁc tokenizers or additional training on specialized systems, which is conﬁrmed by the successful experience of BloomberGPT. Special attention is paid to the problem of processing long texts. Three solution approaches are proposed: text partitioning; hierarchical processing; extrapolation of transformer models. In conclusion, the importance of tokenization for ﬁnancial analytics is emphasized, where the quality of text processing directly affects decision-making. The development of tokenization methods continues in parallel with the improvement of NLP models, which makes this stage of text processing a critical component of modern analytical systems.

Keywords

tokenization, long sequences, fine-tuning, transformer models, data array, financial analytics, context loss minimization

About the Authors

E. F. Boltachev

Financial University under the Government of the Russian Federation
Россия

Eldar F. Boltachev — PhD (Tech.), Assoc. Prof., Department of Artificial Intelligence, Faculty of Information Technology and Big Data Analysis; Center for Digital Transformation and Artificial Intelligence

Moscow

M. P. Farhadov

Trapeznikov Institute of Control Sciences of the Russian Academy of Sciences
Россия

Mais P. Farhadov — PhD (Tech.), Senior Researcher, Head of the “Ergatic Systems” Laboratory

Moscow

A. I. Tyulyakov

Financial University under the Government of the Russian Federation
Россия

Alexander I. Tyulyakov — student of the Faculty of Information Technology and Big Data Analysis

Moscow

References

1. Pankratova M. D., Skovpen T. N. NLP Models Using Neural Networks for Sentiment Analysis of News. Analytical Technologies in the Social Sphere: Theory and Practice. 2023;97–107. URL: https://www.elibrary.ru/ctabku

2. Markov A. K., Semenochkin D. O., Kravets A. G., Yanovsky T. A. Comparative Analysis of Applied Natural Language Processing Technologies for Improving the Quality of Digital Document Classiﬁcation. International Journal of Information Technologies. 2024;12(3):66–77. URL: https://www.elibrary.ru/tubosi

3. Araci D. FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models. arXiv preprint arXiv:1908.10063.2019:7. DOI: 10.48550/arXiv.1908.10063

4. Jaiswal A., Milios E. Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classiﬁcation with BERT. arXiv preprint arXiv:2310.20558. 2023:13. DOI: 10.48550/arXiv.2310.20558

5. Condevaux C., Harispe S. LSG Attention: Extrapolation of Pretrained Transformers to Long Sequences. Advances in Knowledge Discovery and Data Mining (PAKDD). 2023;13935 LNCS:443–454. DOI: 10.48550/arXiv.2210.15497

6. Sennrich R., Haddow B., Birch A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL. 2016;1715–1725. DOI: 10.48550/arXiv.1508.07909

7. Bostrom K., Durrett G. Byte Pair Encoding is Suboptimal for Language Model Pretraining. Proceedings of EMNLP. 2020:461–466. DOI: 10.18653/v1/2020.ﬁndings-emnlp.414

8. Wu S., et al. BloombergGPT: A Large Language Model for Finance. arXiv preprint arXiv:2303.17564.2023:63. DOI: 10.48550/arXiv.2303.17564

9. Yang Z., et al. Hierarchical Attention Networks for Document Classification. Proceedings of NAACL. 2016:1480–1489. DOI: 10.18653/v1/N16–1174

10. Beltagy I., Peters M. E., Cohan A. Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150.2020:12. DOI: 10.48550/arXiv.2004.05150

11. Zaheer M., et al. Big Bird: Transformers for Longer Sequences. Advances in Neural Information Processing Systems (NeurIPS). 2020;33:17283–17297. DOI: 10.48550/arXiv.2007.14062

12. Dai Z., et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Proceedings of ACL. 2019;2978–2988. DOI: 10.48550/arXiv.1901.02860

13.

Review

For citations:

Boltachev E.F., Farhadov M.P., Tyulyakov A.I. Modern Tokenization Methods for Text Processing in the Financial Domain. Digital Solutions and Artificial Intelligence Technologies. 2025;1(3):19-29. (In Russ.) https://doi.org/10.26794/3033-7097-2025-1-3-19-29

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 3033-7097 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Digital Solutions and Artificial Intelligence Technologies

Modern Tokenization Methods for Text Processing in the Financial Domain

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy