Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Preprocessing of Subword Encoding for NMT
Thesis title in Czech: Předzpracování podslovních jednotek pro neuronový strojový překlad
Thesis title in English: Preprocessing of Subword Encoding for NMT
Key words: neuronový strojový překlad|segmentace na podslovní jednotky|Byte Pair Encoding|tokenizace
English key words: neural machine translation|subword segmentation|Byte Pair Encoding|tokenization
Academic year of topic announcement: 2022/2023
Thesis type: diploma thesis
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: Mgr. Martin Popel, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 28.02.2023
Date of assignment: 20.03.2023
Confirmed by Study dept. on: 28.03.2023
Date and time of defence: 12.06.2024 09:00
Date of electronic submission:03.05.2024
Date of submission of printed version:02.05.2024
Date of proceeded defence: 12.06.2024
Opponents: Abishek Stephen J.
 
 
 
Guidelines
Subword encoding has become an integral part of the current neural machine translation approaches. However, classical subword encoding approaches, such as BPE or WordPiece, are prone to output divergent encodings even for minor differences in input data, for instance uppercase/lowercase or with/without diacritics. The goal of the thesis is to design, implement and evaluate an alternative subword segmentation algorithm with the following properties. First, it should be almost lossless (the original text can be reconstructed from the encoded sequence, with only few exceptions such as white space normalization). Second, the length of the encoded sequence should be minimized (given a fixed subword vocabulary size). Third, it should maximize the number of common subword indices for similar strings (changes in casing, diacritics etc.). Possibly, it could try to maximize the subword overlap across related languages.

If the research outcomes are positive, we expect them to be incorporated into the Charles translator for Ukraine, the Czech-Ukrainian translator developed by UFAL.
References
Alexandre Berard, Ioan Calapodescu, and Claude Roux. 2019. Naver Labs Europe’s Systems for the WMT19 Machine Translation Robustness Task. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 526–532, Florence, Italy. Association for Computational Linguistics. https://aclanthology.org/W19-5361/

Rexline S. J., Robert L. 2011. Substitution Coder – A Reversible Data Transform for Lossless Text Compression
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6173125

Thierry Etchegoyhen and Harritxu Gete. 2020. To Case or not to case: Evaluating Casing Methods for Neural Machine Translation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3752–3760, Marseille, France. European Language Resources Association. https://aclanthology.org/2020.lrec-1.463.pdf
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html