Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Identification of typical features of machine translation

Thesis title in Czech:	Identifikace typických rysů strojového překladu
Thesis title in English:	Identification of typical features of machine translation
Key words:	strojový překlad\|neuronové sítě\|deep learning\|strojové učení\|NLP\|zpracování přirozeného jazyka
English key words:	machine translation\|neural networks\|deep learning\|machine learning\|natural language processing
Academic year of topic announcement:	2022/2023
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	Mgr. Jindřich Libovický, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	24.10.2022
Date of assignment:	24.10.2022
Confirmed by Study dept. on:	27.03.2023
Date and time of defence:	03.06.2025 09:00
Date of electronic submission:	29.04.2025
Date of submission of printed version:	30.04.2025
Date of proceeded defence:	03.06.2025
Opponents:	Mgr. Martin Popel, Ph.D.

Guidelines

In some domains and under limited circumstances, machine translation reaches such an output quality that it is hardly possible for human evaluators to distinguish what is human and what is machine translation. On the other hand, training a machine learning model that distinguishes authentic and generated text is relatively simple. Recently, interpretable text classifiers were developed that can tell what parts of the sentence were the decisions based on. This will be the starting point of the thesis.

The thesis will proceed in two steps. The first goal of this thesis is to develop a classifier for distinguishing between human and machine translation when using high-quality machine translation. The models will be probably based on pre-trained multilingual Transformer models, such as XLM-R. The second step will be using model interpretability methods (such as integrated gradient saliency) to analyze what features allow the models to distinguish generated text from authentic ones. The analysis will to hypotheses, on why this might be difficult for human evaluators.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Sundararajan, M., Taly, A., & Yan, Q. (2017, July). Axiomatic attribution for deep networks. In International conference on machine learning (pp. 3319-3328). PMLR.

Ippolito, D., Duckworth, D., Callison-Burch, C., & Eck, D. (2020, July). Automatic Detection of Generated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 1808-1822).

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020, July). Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440-8451).