Thesis (Selection of subject)Thesis (Selection of subject)(version: 390)
Thesis details
   Login via CAS
Improving Czech-Ukrainian Machine Translation
Thesis title in Czech: Zlepšování česko-ukrajinského strojového překladu
Thesis title in English: Improving Czech-Ukrainian Machine Translation
Key words: neuronový strojový překlad|překlad pojmenovaných entit|ukrajinský překlad|předběžné trénování se šumem
English key words: neural machine translation|named entity translation|Ukrainian translation|noise pretraining
Academic year of topic announcement: 2022/2023
Thesis type: diploma thesis
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: Mgr. Martin Popel, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 11.04.2023
Date of assignment: 15.04.2023
Confirmed by Study dept. on: 02.05.2023
Date and time of defence: 03.06.2025 09:00
Date of electronic submission:30.04.2025
Date of submission of printed version:30.04.2025
Date of proceeded defence: 03.06.2025
Opponents: RNDr. David Mareček, Ph.D.
 
 
 
Guidelines
Current best Neural Machine Translation systems segment both the input texts and translations into into subword units (tokens), so that a simple concatenation of the tokens reconstructs the original text. Only very light pre-processing is being used, e.g., substituting spaces with underscores or deleting space tokens between two alphanumeric tokens. Even minor differences in orthography or morphology may result in completely different subword IDs, which exacerbates the data-sparseness problem and cannot be easily compensated by neural models' training.

The goal of this thesis is to develop tools for pre-processing of the input texts, which simplifies or normalizes language phenomena that are not important for the translation or which can be easily reconstructed. For example, Ukrainian prepositions and prefixes "u" and "v" (i.e. "у" and "в" in the Cyrillic script) have the same meaning in most words and the choice is based on orthoepic rules (e.g. "учитель" vs "вчитель"). Normalizing these variants as a pre-processing step for Ukrainian-to-Czech translation (both for the training and test data) could improve the quality. It could also improve the Czech-to-Ukrainian translation quality, where the tool would need to post-process the simplified translations to follow the Ukrainian orthoepic rules. The tool could be also used for correcting (post-editing) Ukrainian texts, which do not adhere to the rules. Part of the thesis is to find other phenomena where simplification/normalization helps to improve the translation quality and properly evaluate the effects of different pre- and post-processing options. Possible enhancement is to optionally generate word-stress accents in the Ukrainian translation.

If the research outcomes are positive, we expect them to be incorporated into the Charles translator for Ukraine, the Czech-Ukrainian translator developed by UFAL.
References
Jakub Náplava, Martin Popel, Milan Straka, Jana Straková (2021): Understanding Model Robustness to User-generated Noisy Texts. In: Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021), pp. 340-350, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-954085-90-9. https://aclanthology.org/2021.wnut-1.38/

Korobov, M. (2015). Morphological Analyzer and Generator for Russian and Ukrainian Languages. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds) Analysis of Images, Social Networks and Texts. AIST 2015. Communications in Computer and Information Science, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-319-26123-2_31

Martin Popel, Jindřich Libovický, Jindřich Helcl (2022): CUNI Systems for the WMT 22 Czech-Ukrainian Translation Task. In: Proceedings of the Seventh Conference on Machine Translation, pp. 352-357, Association for Computational Linguistics, Stroudsburg, PA, USA. https://statmt.org/wmt22/pdf/2022.wmt-1.30.pdf
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html