Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Towards Machine Translation Based on Monolingual Texts

Thesis title in Czech:	Strojový překlad na základě jednojazyčných textů
Thesis title in English:	Towards Machine Translation Based on Monolingual Texts
Key words:	strojový překlad\|neřízené učení\|hluboké neuronové sítě\|nízkozdrojové jazyky\|zpracování přirozeného jazyka
English key words:	machine translation\|unsupervised learning\|deep neural networks\|low-resource languages\|natural language processing
Academic year of topic announcement:	2017/2018
Thesis type:	dissertation
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Ondřej Bojar, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	27.09.2018
Date of assignment:	27.09.2018
Confirmed by Study dept. on:	29.10.2018
Date and time of defence:	09.02.2024 09:00
Date of electronic submission:	12.11.2023
Date of submission of printed version:	14.11.2023
Date of proceeded defence:	09.02.2024
Opponents:	Dr. Cristina Espana-Bonet
	RNDr. Martin Čmejrek, Ph.D.

Guidelines

The current state of the art in data-driven machine translation (both the classical statistical MT as well as neural MT, NMT) heavily relies on parallel data, i.e. texts that have been previously translated by humans. This type of resource is usually not a natural side product of other activities, except for limited domains such as nation-wide regulations in countries with more than one official language. Exceptionally, some news agencies provide stories in two or more languages but in general, the construction of parallel corpora is a rather expensive task.

A different breed are the so-called 'comparable corpora'. These contain different texts in various languages, but thanks to some particular commonality, it is known that the texts in all the languages describe the same events, things etc. While there may be no strictly parallel sentences available in a comparable corpus, it is sure that at least translations of individual words or phrases will be present. Examples of comparable corpora include Wikipedia or top news articles published around the same date. The literature offers many methods for extracting translational units from such text collections or providing them with some additional statistics.

The aim of the thesis is to investigate methods of training MT using monolingual texts. The goal can be approached from a rather wide range of angles, e.g. focusing on methods for finding and obtaining well-matching sources from the web, or relying on very large existing collections of text (e.g. CommonCrawl) and devising methods that extract the best-matching sentence pairs, or designing novel ways of NMT training that benefit from the monolingual data. The developed methods should be language-independent as much as possible, but during the development, the main focus will be on a few language pairs, e.g. Czech-Japanese.

An inherent part of the thesis is a careful evaluation in a range of settings: the mentioned Czech-Japanese could serve as a rather low-resource language pair, English-Czech, English-French or other relevant pair will be also tested to check the utility of the method for languages where large parallel data are readily available. The main domain will be news texts, but it would be also interesting to apply the system in a cross-domain setting or consider existing domain-adaptation techniques.

References

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. CoRR, abs/1710.11041, 2017.

Jakub Kúdela, Irena Holubová, and Ondřej Bojar. Extracting parallel paragraphs from common crawl. The Prague Bulletin of Mathematical Linguistics, (107):36–59, 2017.

Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012.
Toward statistical machine translation without parallel corpora.
In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 130--140, Stroudsburg, PA, USA. Association for Computational Linguistics.

Philipp Koehn and Kevin Knight. 2002.
Learning a translation lexicon from monolingual corpora.
In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9, ULA '02, pages 9--16, Stroudsburg, PA, USA. Association for Computational Linguistics.

Matthew Garvey Snover. 2010.
Improving Statistical Machine Translation Using Comparable Corpora. Dissertation thesis. University of Maryland.

Ivan Vulić and Marie-Francine Moens. 2012.
Detecting highly confident word translations from comparable corpora without any prior knowledge.
In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 449--459, Stroudsburg, PA, USA. Association for Computational Linguistics.