Preprocessing of Subword Encoding for NMT
Název práce v češtině: | Předzpracování podslovních jednotek pro neuronový strojový překlad |
---|---|
Název v anglickém jazyce: | Preprocessing of Subword Encoding for NMT |
Klíčová slova: | neuronový strojový překlad|segmentace na podslovní jednotky|Byte Pair Encoding|tokenizace |
Klíčová slova anglicky: | neural machine translation|subword segmentation|Byte Pair Encoding|tokenization |
Akademický rok vypsání: | 2022/2023 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | Mgr. Martin Popel, Ph.D. |
Řešitel: | skrytý - zadáno a potvrzeno stud. odd. |
Datum přihlášení: | 28.02.2023 |
Datum zadání: | 20.03.2023 |
Datum potvrzení stud. oddělením: | 28.03.2023 |
Datum a čas obhajoby: | 12.06.2024 09:00 |
Datum odevzdání elektronické podoby: | 03.05.2024 |
Oponenti: | Abishek Stephen J. |
Zásady pro vypracování |
Subword encoding has become an integral part of the current neural machine translation approaches. However, classical subword encoding approaches, such as BPE or WordPiece, are prone to output divergent encodings even for minor differences in input data, for instance uppercase/lowercase or with/without diacritics. The goal of the thesis is to design, implement and evaluate an alternative subword segmentation algorithm with the following properties. First, it should be almost lossless (the original text can be reconstructed from the encoded sequence, with only few exceptions such as white space normalization). Second, the length of the encoded sequence should be minimized (given a fixed subword vocabulary size). Third, it should maximize the number of common subword indices for similar strings (changes in casing, diacritics etc.). Possibly, it could try to maximize the subword overlap across related languages.
If the research outcomes are positive, we expect them to be incorporated into the Charles translator for Ukraine, the Czech-Ukrainian translator developed by UFAL. |
Seznam odborné literatury |
Alexandre Berard, Ioan Calapodescu, and Claude Roux. 2019. Naver Labs Europe’s Systems for the WMT19 Machine Translation Robustness Task. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 526–532, Florence, Italy. Association for Computational Linguistics. https://aclanthology.org/W19-5361/
Rexline S. J., Robert L. 2011. Substitution Coder – A Reversible Data Transform for Lossless Text Compression https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6173125 Thierry Etchegoyhen and Harritxu Gete. 2020. To Case or not to case: Evaluating Casing Methods for Neural Machine Translation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3752–3760, Marseille, France. European Language Resources Association. https://aclanthology.org/2020.lrec-1.463.pdf |