Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Zlepšování česko-ukrajinského strojového překladu

Název práce v češtině:	Zlepšování česko-ukrajinského strojového překladu
Název v anglickém jazyce:	Improving Czech-Ukrainian Machine Translation
Klíčová slova:	strojový překlad\|morfologická normalizace\|ukrajinština
Klíčová slova anglicky:	machine translation\|morphological normalization\|Ukrainian
Akademický rok vypsání:	2022/2023
Typ práce:	diplomová práce
Jazyk práce:
Ústav:	Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel:	Mgr. Martin Popel, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	11.04.2023
Datum zadání:	15.04.2023
Datum potvrzení stud. oddělením:	02.05.2023

Zásady pro vypracování

Current best Neural Machine Translation systems segment both the input texts and translations into into subword units (tokens), so that a simple concatenation of the tokens reconstructs the original text. Only very light pre-processing is being used, e.g., substituting spaces with underscores or deleting space tokens between two alphanumeric tokens. Even minor differences in orthography or morphology may result in completely different subword IDs, which exacerbates the data-sparseness problem and cannot be easily compensated by neural models' training.

The goal of this thesis is to develop tools for pre-processing of the input texts, which simplifies or normalizes language phenomena that are not important for the translation or which can be easily reconstructed. For example, Ukrainian prepositions and prefixes "u" and "v" (i.e. "у" and "в" in the Cyrillic script) have the same meaning in most words and the choice is based on orthoepic rules (e.g. "учитель" vs "вчитель"). Normalizing these variants as a pre-processing step for Ukrainian-to-Czech translation (both for the training and test data) could improve the quality. It could also improve the Czech-to-Ukrainian translation quality, where the tool would need to post-process the simplified translations to follow the Ukrainian orthoepic rules. The tool could be also used for correcting (post-editing) Ukrainian texts, which do not adhere to the rules. Part of the thesis is to find other phenomena where simplification/normalization helps to improve the translation quality and properly evaluate the effects of different pre- and post-processing options. Possible enhancement is to optionally generate word-stress accents in the Ukrainian translation.

If the research outcomes are positive, we expect them to be incorporated into the Charles translator for Ukraine, the Czech-Ukrainian translator developed by UFAL.

Seznam odborné literatury

Jakub Náplava, Martin Popel, Milan Straka, Jana Straková (2021): Understanding Model Robustness to User-generated Noisy Texts. In: Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021), pp. 340-350, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-954085-90-9. https://aclanthology.org/2021.wnut-1.38/

Korobov, M. (2015). Morphological Analyzer and Generator for Russian and Ukrainian Languages. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds) Analysis of Images, Social Networks and Texts. AIST 2015. Communications in Computer and Information Science, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-319-26123-2_31

Martin Popel, Jindřich Libovický, Jindřich Helcl (2022): CUNI Systems for the WMT 22 Czech-Ukrainian Translation Task. In: Proceedings of the Seventh Conference on Machine Translation, pp. 352-357, Association for Computational Linguistics, Stroudsburg, PA, USA. https://statmt.org/wmt22/pdf/2022.wmt-1.30.pdf