Improving Czech-Ukrainian Machine Translation
Thesis title in Czech: | Zlepšování česko-ukrajinského strojového překladu |
---|---|
Thesis title in English: | Improving Czech-Ukrainian Machine Translation |
Key words: | neuronový strojový překlad|překlad pojmenovaných entit|ukrajinský překlad|předběžné trénování se šumem |
English key words: | neural machine translation|named entity translation|Ukrainian translation|noise pretraining |
Academic year of topic announcement: | 2022/2023 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | Mgr. Martin Popel, Ph.D. |
Author: | hidden![]() |
Date of registration: | 11.04.2023 |
Date of assignment: | 15.04.2023 |
Confirmed by Study dept. on: | 02.05.2023 |
Date and time of defence: | 03.06.2025 09:00 |
Date of electronic submission: | 30.04.2025 |
Date of submission of printed version: | 30.04.2025 |
Date of proceeded defence: | 03.06.2025 |
Opponents: | RNDr. David Mareček, Ph.D. |
Guidelines |
Current best Neural Machine Translation systems segment both the input texts and translations into into subword units (tokens), so that a simple concatenation of the tokens reconstructs the original text. Only very light pre-processing is being used, e.g., substituting spaces with underscores or deleting space tokens between two alphanumeric tokens. Even minor differences in orthography or morphology may result in completely different subword IDs, which exacerbates the data-sparseness problem and cannot be easily compensated by neural models' training.
The goal of this thesis is to develop tools for pre-processing of the input texts, which simplifies or normalizes language phenomena that are not important for the translation or which can be easily reconstructed. For example, Ukrainian prepositions and prefixes "u" and "v" (i.e. "у" and "в" in the Cyrillic script) have the same meaning in most words and the choice is based on orthoepic rules (e.g. "учитель" vs "вчитель"). Normalizing these variants as a pre-processing step for Ukrainian-to-Czech translation (both for the training and test data) could improve the quality. It could also improve the Czech-to-Ukrainian translation quality, where the tool would need to post-process the simplified translations to follow the Ukrainian orthoepic rules. The tool could be also used for correcting (post-editing) Ukrainian texts, which do not adhere to the rules. Part of the thesis is to find other phenomena where simplification/normalization helps to improve the translation quality and properly evaluate the effects of different pre- and post-processing options. Possible enhancement is to optionally generate word-stress accents in the Ukrainian translation. If the research outcomes are positive, we expect them to be incorporated into the Charles translator for Ukraine, the Czech-Ukrainian translator developed by UFAL. |
References |
Jakub Náplava, Martin Popel, Milan Straka, Jana Straková (2021): Understanding Model Robustness to User-generated Noisy Texts. In: Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021), pp. 340-350, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-954085-90-9. https://aclanthology.org/2021.wnut-1.38/
Korobov, M. (2015). Morphological Analyzer and Generator for Russian and Ukrainian Languages. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds) Analysis of Images, Social Networks and Texts. AIST 2015. Communications in Computer and Information Science, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-319-26123-2_31 Martin Popel, Jindřich Libovický, Jindřich Helcl (2022): CUNI Systems for the WMT 22 Czech-Ukrainian Translation Task. In: Proceedings of the Seventh Conference on Machine Translation, pp. 352-357, Association for Computational Linguistics, Stroudsburg, PA, USA. https://statmt.org/wmt22/pdf/2022.wmt-1.30.pdf |