Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Improving Subword Tokenization Methods for Multilingual Models

Název práce v češtině:	Vylepšení metod tokenizace pro vícejazyčné modely
Název v anglickém jazyce:	Improving Subword Tokenization Methods for Multilingual Models
Klíčová slova:	natural language processing\|multilingual language models\|subword tokenization\|NLP
Klíčová slova anglicky:	natural language processing\|multilingual language models\|subword tokenization\|NLP
Akademický rok vypsání:	2022/2023
Typ práce:	diplomová práce
Jazyk práce:	angličtina
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	Ing. Tomasz Limisiewicz, Ph.D.
Řešitel:	Mgr. Jiří Balhar - zadáno a potvrzeno stud. odd.
Datum přihlášení:	03.03.2023
Datum zadání:	15.06.2023
Datum potvrzení stud. oddělením:	11.07.2023
Datum a čas obhajoby:	05.09.2023 09:00
Datum odevzdání elektronické podoby:	20.07.2023
Datum odevzdání tištěné podoby:	24.07.2023
Datum proběhlé obhajoby:	05.09.2023
Oponenti:	Mgr. Martin Popel, Ph.D.

Zásady pro vypracování

Tokenization of input text is a crucial preprocessing step for any natural language processing (NLP) task. Specifically, in the context of large transformer language models, subword tokenization methods such as BPE [1], Wordpiece [2], or Unigram LM [3] were utilized to tackle the out-of-vocabulary (OOV) problem and curb the vocabulary size by splitting less frequent words into constituent subword units. While much of the research effort on large language models went into scaling the models, gathering more training data, or increasing the number of languages, only lately have we seen attention toward improving tokenization methods. Recent works investigated scaling the size of the vocabularies to allocate enough vocabulary for all languages [4] and introducing novel tokenization approaches suited for multilingual data [5,6].

The thesis aims to find an optimal tokenization method for multilingual language models. It will summarize the current research on tokenization [7] with a special focus on the multilingual setting. Then, using the findings in [8], the thesis will analyze the allocation and overlap of lexical units in the current methods and propose an improvement based on these metrics, possibly in line with the recent research using clustering of the training corpora [5]. The improvements will then be validated on multilingual downstream tasks.

Seznam odborné literatury

[1] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17151725, Berlin, Germany. Association for Computational Linguistics.
[2] Mike Schuster and Kaisuke Nakajima. 2012. Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152.
[3] Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics
[4] Zheng, Bo, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, and Furu Wei. “Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3203–15. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.emnlp-main.257.
[5] Chung, Hyung Won, Dan Garrette, Kiat Chuan Tan, and Jason Riesa. “Improving Multilingual Models with Language-Clustered Vocabularies.” arXiv, October 24, 2020. http://arxiv.org/abs/2010.12777.
[6] Liang, Davis, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, and Madian Khabsa. “XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models.” arXiv, January 25, 2023. http://arxiv.org/abs/2301.10472.
[7] Mielke, Sabrina J., Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, et al. “Between Words and Characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP.” ArXiv, December 20, 2021. https://www.semanticscholar.org/paper/Between-words-and-characters%3A-A-Brief-History-of-in-Mielke-Alyafeai/d617f51833860dc50d202af7f80be71304b2e994.
[8] Limisiewicz, Tomasz, Jiří Balhar, David Mareček "Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages” under review for ACL 2023, January 2023