Identification of typical features of machine translation
Thesis title in Czech: | Identifikace typických rysů strojového překladu |
---|---|
Thesis title in English: | Identification of typical features of machine translation |
Key words: | strojový překlad|neuronové sítě|deep learning|strojové učení|NLP|zpracování přirozeného jazyka |
English key words: | machine translation|neural networks|deep learning|machine learning|natural language processing |
Academic year of topic announcement: | 2022/2023 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | Mgr. Jindřich Libovický, Ph.D. |
Author: | hidden![]() |
Date of registration: | 24.10.2022 |
Date of assignment: | 24.10.2022 |
Confirmed by Study dept. on: | 27.03.2023 |
Date and time of defence: | 03.06.2025 09:00 |
Date of electronic submission: | 29.04.2025 |
Date of submission of printed version: | 30.04.2025 |
Date of proceeded defence: | 03.06.2025 |
Opponents: | Mgr. Martin Popel, Ph.D. |
Guidelines |
In some domains and under limited circumstances, machine translation reaches such an output quality that it is hardly possible for human evaluators to distinguish what is human and what is machine translation. On the other hand, training a machine learning model that distinguishes authentic and generated text is relatively simple. Recently, interpretable text classifiers were developed that can tell what parts of the sentence were the decisions based on. This will be the starting point of the thesis.
The thesis will proceed in two steps. The first goal of this thesis is to develop a classifier for distinguishing between human and machine translation when using high-quality machine translation. The models will be probably based on pre-trained multilingual Transformer models, such as XLM-R. The second step will be using model interpretability methods (such as integrated gradient saliency) to analyze what features allow the models to distinguish generated text from authentic ones. The analysis will to hypotheses, on why this might be difficult for human evaluators. |
References |
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Sundararajan, M., Taly, A., & Yan, Q. (2017, July). Axiomatic attribution for deep networks. In International conference on machine learning (pp. 3319-3328). PMLR. Ippolito, D., Duckworth, D., Callison-Burch, C., & Eck, D. (2020, July). Automatic Detection of Generated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 1808-1822). Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020, July). Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440-8451). |